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METHOD AND APPARATUS FOR ALIASING 
MEMORY DATA IN AN ADVANCED 
MICROPROCESSOR 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates to computer systems and, more 
particularly, to an improved microprocessor which utilizes 
methods and apparatus for storing frequently utilized 
memory data in registers for more rapid access. 

2. History of the Prior Art 

There are thousands of application programs which run on 
computers designed around particular families of micropro- 
cessors. The largest number of programs in existence are 
designed to run on computers (generally referred to as "IBM 
Compatible Personal Computers*') using the "X86" family 
of microprocessors (including the Intel® 8088, Intel 8086, 
Intel 80186, Intel 80286, 1386, i486, and progressing 
through the various Pentium® microprocessors) designed 
and manufactured by Intel Corporation of Santa Clara, Calif. 
There are many other examples of programs designed to run 
on computers using other families of processors. Because 
there are so many application programs which run on these 
computers, there is a large market for microprocessors 
capable of use in such computers, especially computers 
designed to process X86 programs. The microprocessor 
market is not only large but also quite lucrative. 

Although the market for microprocessors which are able 
to run large numbers of application programs is large and 
lucrative, it is quite difficult to design a new competitive 
microprocessor. For example, even though the X86 family 
of processors has been in existence for a number of years 
and these processors are included in the majority of com- 
puters sold and used, there are few successftxl competitive 
microprocessors which are able to run X86 programs. The 
reasons for this are many. 

In order to be successful, a microprocessor must be able 
to run all of the programs (including operating systems and 
legacy programs) designed for that family of processors as 
fast as existing processors without costing more than exist- 
ing processors. In addition, to be economically successful, a 
new microprocessor must do at least one of these things 
better than existing processors to give buyers a reason to 
choose the new processor over existing proven processors. 

It is difficult and expensive to make a microprocessor run 
as fast as state of the art microprocessors. Processors carry 
out instructions through primitive operations such as 
loading, shifting, adding, storing, and similar low level 
operations and respond only to such primitive instructions in 
executing any instruction famished by an appKcation pro- 
gram. For example, a processor designed to run the instruc- 
tions of a complicated instruction set computer (CISC) such 
as a X86 in which instructions may designate the process to 
be carried out at a relatively high level have historically 
included read only memory (ROM) which stores so-called 
micro -instructions. Each micro -instruction includes a 
sequence of primitive instructions which when run in suc- 
cession bring about the result commanded by the high level 
CISC instruction. 

Typically, an "add A to B" CISC instruction is decoded to 
cause a look up of an address in ROM at which a micro- 
instruction for carrying out the functions of the "add A to B" 
instruction is stored. The micro -instruction is loaded, and its 
primitive instructioos are run in sequence to cause the "add 
A to B" instmction to be carried out. With such a CISC 
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computer, the primitive operarions within a micro- 
instruction can never be changed during program execution. 
Each CISC instruction can only be run by decoding the 
instruction, addressing and fetching the micro-instruction, 
5 and running the sequence of primitive operations in the 
order provided in the micro -instruction. Each time the 
micro-instruction is run, the same sequence must be fol- 
lowed. 

State of the art processors for running X86 applications 
10 utilize a number of techniques to provide the fastest pro- 
cessing possible at a price which is still economically 
reasonable. Any new processor which implements known 
hardware techniques for accelerating the speed at which a 
processor may run must increase the sophistication of the 
15 processing hardware. This requires increasing the cost of the 
hardware. 

For example, a superscalar microprocessor which uses a 
plurality of processing channels in order to execute two or 
more operations at once has a number of additional require - 
ments. At the most basic level, a simple superscalar micro- 
processor might decode each apphcation instruction into the 
micro-instructions which carry out the fimction of the appli- 
cation instmction. Then, the simple superscalar micropro- 
cessor schedules two micro-instructions to run together if 
the two micro-instructions do not require the same hardware 
resources and the execution of a micro -instruction does not 
depend on the results of other micro-instructions being 
processed. 

A more advanced superscalar microprocessor typically 
decodes each appUcation instruction into a series of primi- 
tive instructions so that those primitive instructions may be 
reordered and scheduled into the most efficient execution 
order. This requires that each individual primitive operation 
be addressed and fetched. To accomplish reordering, the 
processor must be able to ensure that a primitive instruction 
which requires data resulting from another primitive instruc- 
tion is run after that other primitive instruction produces the 
needed data. Such a superscalar microprocessor must assure 
that two primitive instructions being run together do not 
both require the same hardware resources. Such a processor 
must also resolve conditional branches before the effects of 
branch operations can be completed. 

Thus, superscalar microprocessors require extensive 

45 hardware to compare the relationships of the primitive 
instructions to one another and to reorder and schedule the 
sequence of the primitive instructions to carry out any 
instruction. As the number of processing channels increases, 
the amount and cost of the hardware to accomplish these 

50 superscalar acceleration techniques increases approximately 
quadratically. All of these hardware requirements increase 
the complexity and cost of the circuitry involved. As in 
dealing with micro -instructions, each time an application 
instruction is executed, a superscalar microprocessor must 

55 use its relatively complicated addressing and fetching hard- 
ware to fetch each of these primitive instructions, must 
reorder and reschedule these primitive instructions based on 
the other primitive instructions and hardware usage, and 
then must execute all of the rescheduled primitive instruc- 

6Q tions. The need to run each application instruction through 
the entire hardware sequence each time it is executed limits 
the speed at which a superscalar processor is capable of 
executing its instructions. 
Moreover, even though these various hardware tech- 

65 niques increase the speed of processing, the complexity 
involved in providing such hardware significantly increases 
the cost of such a microprocessor. For example, the Intel 
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i4S6 DX4 processor uses approximately 1.5 million transis- The reasons why emulator programs are not able to run 

tors. Adding the hardware required to accomplish the check- target programs as rapidly as the target microprocessors is 

ing of dependencies and scheduling necessary to process quite complicated and requires some understanding of the 

instructions through two channels in a basic superscalar different emulation operations. FIG. 1 includes a series of 
microprocessor such as the Intel Pentium® requires the use S diagrams representing the different ways in which a plurality 

of more than three million transistors. Adding the hardware of different types of microprocessors execute target appli- 

to allow reordering among primitive instructions derived cation programs. 

from different target instructions, provide speculative 1° FIG- K")) ^ typical CISC microprocessor such as an 
execution, allow register renaming, and provide branch Intel X86 microprocessor is shown running a target appli- 
prediction increases the number of transistors to over six lO cation program which is designed to be run on that target 
million in the Intel Pentium Pro™ microprocessor. Thus, it processor. As may be seen, the application is nm on the 
can be seen that each hardware addition to increase opera- CISC processor using a CISC operating system (such as MS 
tion speed has drastically increased the number of transistors DOS, Windows 3.1, Windows NT, and OS/2 which are used 
in the latest state of the art microprocessors. with X86 computers) designed to provide interfaces by 
Even using these known techniques may not produce a is which access to the hardware of the computer may be 
microprocessor faster than existing microprocessors because Typically, the mstructions of the application pro- 
manufacturers use most of the economically feasible tech- S""^ f l^'^'^'* ^° ^'^'^f f"^ ^^'^^^ °^ computer only 
niques known to accelerate the operation of existing micro- P"^'^f operatog system- Thus, 
processors. Consequently, designing a faster processor is a operatmg system handles the manipulations which allow 
very difScult and expensive task. Reducing the cost of a 20 apphcations acce^ to memory and to the various input/ 
processor is also very difBcult. As illustrated above, hard- "^'r' ^^""^^ computer. The target computer 
ware acceleration techniques which produce a sifladenUy '"^^^'^^.^ '°«"'°fy t^dware which the operatmg system 
capable processor are very expensive. One designing a new ^cognizes, and a call to the operatmg system from a target 
processor must obtain the facilities to produce the hardware. iPPlicaHon causes an operatmg system device dnver to 
Such facilities are very difBcult to obtain because chip 25 cause an expected operation to occur with a defined device 
manufacturers do not typically spend assets on small runs of '"^S*' fmputer- The mstructions of the apphcation 
devices. Hie capital investment required to produce a chip °" processor where they are changed mto 
manufacturing facility is so great that it is beyond the reach operations (embodied m microcode or the inore pnmmye 
of most companies operations from which microcode is assembled) which the 
_ , , . 1 1 1 . , - 1 30 processor is capable of executing. As has been described 
Even though one IS able to design a new processor which ^^ove, each time a complicated target instruction is 
runs all of the application programs designed for a family of executed, the instructioQ calls the same subroutine stored as 
proc6s«,rs at least as fast as coinpetitive processo^, the ^^^^^^^ („, the same set of primitive operations). Hie 
price of competitive processors includes sufficient profit that ^^^^ subroutine is always executed. If the processor is a 

substantial price reductions are sure to be laced bv any i *u • *• c • 

u^i<Lx iJLx V- ^ivjiio ai ouiv <^v ±avA.vi wjf any supcrscalar, thcsc pnmitive operations for carrymg out a 

P ' target instruction can often be reordered by the processor, 
Although designing a competitive processor by increasing rescheduled, and executed using the various processing 
the complexity of the hardware is very difBcult, another way channels in the manner described above; however, the 
to run application programs (target application programs) subroutine is still fetched and executed, 
designed for a particular family of microprocessors (target ^fy^^ ^ typical RISC microprocessor such as a 
microprocessors) has been to emulate the target micropro- PowerPC microprocessor used in an Apple Macintosh com- 
cessor in software on another faster microprocessor (host p^^^, ^ represented running the same target application 
microprocessor). This is an mcrementally inexpensive program which is designed to be run on the CISC processor 
method of running these programs because it requires only FIG. !(«). As may be seen, the target application is run on 
the addition of some form of emulation software which ^ost processor using at least a partial target operating 
enables the application program to run on a faster micro- ^^^^^^ ^^^^^^^ ^ ^^^^^^^ ^^^^ ^^^^^ ^^.g^^ 
processor. The emulator software changes the target mstnic- application generates. Typically these are calls to the 
tions of an apphcation program written for the target pro- application-like portions of the target operating system used 
cesser family mto host mstructions capable of execution by p.^vide graphical interfaces on the display and short 
the host microprocessor. These changed instructions are then ^^^^^y programs which are generally application-like. The 
run under control of the operating system on the faster host ^^^^^^ apphcation and these portions of the target operating 
microprocessor. system are changed by a software emulator such as Soft 
There have been a number of different designs by which pc® which breaks the instructions furnished by the target 
target applications may be run on host computers with faster application program and the application-like target operat- 
processors than the processors of target computers. In 55 ing system programs into instructions which the host pro- 
general, the host computers executing target programs nsing cessor and its host operating system are capable of execut- 
emulation software utilize reduced instruction set (RISC) ing. The host operating system provides the interfaces 
microprocessors because RISC processors are theoretically through which access to the memory and input/output hard- 
simpler and consequently can run faster than other types of ware of the RISC computer may be gained, 
processors. ^0 However, the host RISC processor and the hardware 
However, even though RISC computer systems running devices associated with it in a host RISC computer are 
emulator software are often capable of running X86 (or usually quite different than are the devices associated with 
other) programs, they usually do so at a rate which is the processor for which the target application was designed; 
substantially slower than the rate at which state of the art and the various instructions provided by the target apphca- 
X86 computer systems run the same programs. Moreover, 65 tion program are designed to cooperate with the device 
often these emulator programs are not able to run all or a drivers of the target operating system in accessing the 
large number of the target programs available. various portions of the target computer. Consequently, the 
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emulation program, which changes the instructions of the provided by the exception handler is taken. In any case, the 

target application program to primitive host instructions hardware and software of the target computer required to 

which the host operating system is capable of utilizing, must accomplish these operations must somehow be provided in 

somehow link the operations designed to operate hardware the process of emulation. 

devices in the target computer to operations which hardware 5 Because the correct target state must be available at the 

devices ot the host system are capable oi implementme. .* r i. c 

* 1 * r* . . 1 time of any such exception for proper execution, the emu- 

Otten this requires the emulator software to create virtual , , ■ r , ^ . . i 0.1 - . . . iw 

A^.r'^^^ A ^ «f ^4 lator IS forced to keep accurate track of this state at all times 

devices which respond to the instructions ot the target .1 . ■ li ^ .1 -t . 

1- ^ ^ ^ ^. I.' L L i . • so that it IS able to correctly respond to these exceptions. In 

application to carry out operations which the host system IS , ^ • , • ^ 1 - . • 

. .If . /. *u * » J • * the prior art, this has required executing each instruction m 

incapable of carrying out because the target devices are not in . -j j t. . ^ T- ^ 1 • 

*u r u * * o 1 . ■ " the order provided by the target application because only m 

those of the host computer. Sometimes the emulator is . u . . r . . l • . • j 

. J . .VIC .u 1 J • *i. t_ this way could correct target state be maintained, 

required to create hnks from these virtual devices through ^ ^ 

the host operating system to host hardware devices which Moreover, prior art emulators have always been required 

are present but are addressed in a different manner by the maintain the order of execution of the target appHcation 

host operating system. o^^^r reasons. Target instructions can be of two types, 

Target programs when executed in this manner run rela- ^^^^ ^t^^^^ ^^^^^ memoiy or oneswhich affect a memory 
lively slowly for a number of reasons. First, each target "^^^J^^ mput/output (VIO) device. There is no way to know 
instruction from a target application program and from the ^'^^^^^ attemptmg to execute an instruction whether an 
target operating system must be changed by the emulator ^P^^^^^^Ji^f effect memory or a memory-mapped I/O 
into the host primitive functions used by the host processor. 20 ^^^^ instructions operate on memory, optimizing 
If the target application is designed for a CISC machine such reordering is possible and greatly aids in speeding the 
as an X86, the target instructions are of varying lengths and operation of a system. However, operations affecting I/O 
quite complicated so that changing them to host primitive ^^^^^^^ "^'^^ practiced m the precise order in which 
instructions is quite involved. The original target instruc- operations are programmed without the eUmination of 
tions are first decoded, and the sequence of primitive host .5 ^""^ ^^^^^ °' "^^^ ^^^^ ^^^^"^ ^^^""^ ''^ 
instructions which make up the target instructions are deter- operation of the I/O device. For example, a particular I/O 
mined. Then the address (or addresses) of each sequence of operation may have the effect of clearmg an I/O register. If 
primitive host instructions is determined, each sequence of operations take place out of order so that a register is 
the primitive host instructions is fetched, and these primitive °^ ^ ^^^^^ which is still necessary, then the result of 
host instructions are executed in or out of order. The large 30 operation may be different than the operation corn- 
number of extra steps required by an emulator to change the ^^^p^ mstniction. Without a means to dis- 
target application and operating system instructions into host ^^^^^^^ memory from memory mapped I/O, it is necessary 
instructions understood by the host processor must be con- 3. "J^^^^^^^^ ^ ^^ough they affect memory 
ducted each time an instruction is executed and slows the 1/0. This severely restricts the nature of optimiza- 
process of emulation. 35 achievable. Because prior art emulators lack 

Second, many target instructions include references to ^"^^ ^'^f the memory being 

operations conducted by particular hardware devices which '^^^^^^^ "^^^f ^^^.^ff such failures, they are 

function in a particular manner in the target computer, ['^^^'"^ proceed sequentially through the target instruc. 

hardware which is not available in the host computer. To ^ ^^^^ operaUon affects memory mapped I/O. 

carry out the operation, the emulation software mu^t either 40 ^""^'^ possibility of optimizing the host 
make software connections to the hardware devices of the 

host computer through the existing host operating system or Another problem which limits the ability of prior art 
the emulator software must furnish a virtual hardware emulators to optimize the host code is caused by self- 
device. Emulating the hardware of another computer in modifying code. If a target instruction has been changed to 
software is very difficult. The emulation software must 45 ^ sequence of host instructions which in tum write back to 
generate virtual devices for each of the target appUcation change the original target instruction, then the host instruc- 
calls to the host operating system; and each of these virtual ^0°^ are no longer valid. Consequently, the emulator must 
devices must provide calls to the actual host devices. Emu- constantly check to determine whether a store is to the target 
lating a hardware device requires that when a target instruc- ^ode area. All of these problems make this type of emulation 
tion is to use the device, the code representing the virtual 50 ^^^^ running a target application on a target 
device required by that instruction be fetched from memory processor. 

and run to implement the device. Either of these methods of Another example of the type of emulation software shown 

solving the problem adds another series of operations to the in FIG. is described in an article entitled, "Talisman: 

execution of the sequence of instructions. Fast and Accurate Multicomputer Simulation,"" R. C. 

Cbmplicating the problem of emulation is the requirement 55 Bedichek, Laboratory for Computer Sciences, Massachu- 

that the target appKcation take various exceptions which are setts Institute of Technology. This is a more complete 

carriedoutby hardware of the target computer and the target example of translation in that it can emulate a complete 

operating system in order for the computer system to oper- research system and run the research target operating sys- 

ate. When a target exception is taken during the operation of tem. Talisman uses a host UNIX operating system, 

a target computer, state of the computer at the time of the 60 In FIG. 1(c), another example of emulation is shown. In 

exception must be saved typically by caUing a microcode this case, a PowerPC microprocessor used in an Apple 

sequence to accomplish the operation, the correct exception Macintosh computer is represented running a target appli- 

handler must be retrieved, the exception must be handled, cation program which was designed to be run on the 

then the correct point in the program must be found for Motorola 68000 family CISC processors used in the original 

continuing with the program. Sometimes this requires that 65 Macintosh computers; this type of arrangement has been 

the program revert to the state of the target computer at the required in order to allow Apple legacy programs to run on 

point the exception was taken, and at other times a branch the Macintosh computers with RISC processors. As may be 
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seen, the target application is run on the host processor using the same calls that the target application generates so that the 

at least a partial target operating system to respond to the generation of virtual devices is considerably reduced, 
application-like portions of the target operating system. A Although this is technicaUy an emulation system running 

software emulator breaks the instrucUons furnished by the ^ application on a host processor, it is a very special 
target apphcaUon program aiid the application-hke target s ^^^^ emulation software is running on a host 

operating system programs mto instructioiis which the host ,. . i j j • j . • -i v 

L i ^ i .ir operatiQE system already designed to run similar appnca- 

processor and its host operatng system are capable of ..'^ °. \. ,i r_ , . . . 

T, ■ , J 7 J .u • • ^ tioDS. This allows the calls from the target applications to be 

executme. The host operating system provides the interfaces ... , , , .r •i-^- , . 

. ... ,.u J- more simply directed to the correct facilities of the host and 

through which access to the memory and inpul/output hard- . », • ^ .i 

ware of the host comnuter may be sained the host operating system. More importanUy, this system 

ware ol the host computer may be gained, ^jj, ^.^ ^j^^^ws applications which probably 

Again the host RISC processor and the devices associ- , „f ^ X86 applications, 

ated with It m the host RISC computer are quite different Moreover, this system will run applications on only one 
than are the devices associated with the Motorola CISC ^i system, Windows NT; while X86 processors run 

processor; and the vanous ta^et instructions are designed to li^^tions designed for a laige number of operating sys- 

cooperate with the target CISC operatmg system m access- ^ ^ therefore, could be considered not to 

ing the various portions of the target computer. compatible within the terms expressed earlier in this 

Consequendy the emulaUon program must hnk the opera- specification. Thus, a processor rmining such an emulator 
tions designed to operate hardware devices in the target considered to be a competitive X86 processor, 

computer to operations which hardware devices of the host ^, , . , ^ 

system are capable of implementing. Tliis requires the Aiiother method of emulation by which software may be 

emulator to create software virtual devices which respond to ™ P^^ions of applications wntten for a first 

the instructions of the target application and to create links ^struction set on a computer which rec«gmzes a different 

from these virtual devices through the host operating system instruction set is illustrated in FIG. 1 (e). This form of 

to host hardware devices which are present but are addressed ^^^^ation software is typically utilized by a programmer 
in a different manner by the host operating system. ,5 P^^^^S ^ application from one computer 

rro. * * 4?* • *u' 1.1 system to another. Typically, the target application is beine 

The target software run in this manner runs relatively / . , ^ jf ^ Ft s 

1 1 f ° . 4U * 4U 14- c T^jr^ 1 /lA designed tor some target computer other than the host 

slowly for the same reasons that the emulation of FIG. l(fc) f. uu^u i^-u- -n. 1* 

, ^. , , , X • * 4' r i * machine on which the emulator is being run. The emulator 

runs slowly. First, each target instruction from the target „^ , ^ 1 . 

application and from the ta^et operating system must be ""^l^' ^ instructions translates those 

changed by fetching the instruction; and aU of the host 30 in~ons into instmcUons which may be nin on the h^ 

' Z- ** f ■ AT *u * • . *u machine, and caches those host instructions so that they may 

pnmitive functions denved from that instruction must be run , .m-- j • i • j « 

- u r ' 4 4- • i J c> J he reused. This dynamic translation and caching allows 

in sequence each time the instruction is executed. Second, ^. ^ . , ■ « r„ . r 

the emulation software must generate virtual devices for °^ applicaUons to be nm very rapidly. This form of 

, r ^ , 1- 11 ^ 41. L * emulator is normally used with software tracing tools to 

each 01 the target application calls to the host operatmg -j j * 1 j • r u ^ v i_ • r . 

system; and each of these virtual devices must provide caUi 35 P""''''' 1^'"'^='* mformation about the behavior of a target 

to the actual host devices. Third, the emulator must treat aU P'^^ram being run. The output of a tracing tool may. m turn, 

instructions as conservatively as it treats instructions which ^ ""^'"^ '"'^^^^ ""^ 

J. . J . J T/^ J ■ -1 trace information, 

are directed to memory mapped I/O devices or risk gener- 
ating exceptions from which it cannot recover. Finally, the ^^^^^ to determine how the code actuaUy functions, an 
emulator must maintain the correct target state at all times 40 emulator of this type, among other things, mns with the host 

and store operations must always check ahead to determine operating system on the host machine, furnishes the virtual 

whether a store is to the target code area. AU of these hardware which the host operating system does not provide, 
requirements eliminate the abiHty of the emulator to practice otherwise maps the operations of the computer for 

significant optimization of the code run on the host proces- "^^^^^ application was designed to the hardware 
sor and make this type of emulation much slower than 45 resources of the host machine in order to carry out the 

running the target application on a target processor. Emu- operations of the program being mn. This software virtual- 
lation rates less than one-quarter as fast as state of the art of hardware and mapping to the host computer can be 

processors are considered very good. In general, this has ^^^V ^^"^^ incomplete. 

relegated this type of emulation software to uses where the Moreover, because it often requires a plurality of host 

capability of running applications designed for another 50 instructions to carry out one of the target instructions, 

processor is useful but not primary. exceptions including faults and traps which require a target 

In FIG. l{d), a particular method of emulating a target operating system exception handler may be generated and 

application program on a host processor which provides cause the host to cease processing the host instructions at a 

relatively good performance for a very limited series of point unrelated to target instruction boundaries. When this 
target appHcations is illustrated. The target application fur- 55 happens, it may be impossible to handle the exception 

nishes instructions to an emulator which changes those correctly because the state of the host processor and memory 

instructions into instructions for the host processor and the is incorrect. If this is the case, the emulator must be stopped 
host operating system. The host processor is a Digital rerun to trace the operations which generated the 

Equipment Corporation Alpha RISC processor, and the host exception. Thus, even though such an emulator may run 
operating system is Microsoft NT. The only target applica- 60 sequences of target code very rapidly, it has no method for 

tions which may be run by this system are 32 bit applications recovering from these exceptions so cannot run any signifi- 

dcsigned to be executed by a target X86 processor with a cant portion of an apphcation rapidly. 
Windows WIN32S compliant operating system. Since the This is not a particular problem with this form of emulator 

host and target operating systems are almost identical, being because the functions being performed by the emulators, 
designed to handle these same instructions, the emulator 65 tracers, and the associated analyzers are directed to gener- 

software may change the instructions very easily. Moreover, ating new programs or porting old programs to another 

the host operating system is already designed to respond to machine so that the speed at which the emulator software 
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runs is rarely at issue. That is, a programmer is usually not FIG. 12 is a block diagram illustrating in detail memory 

interested in how fast the code produced by a emulator runs aliasing circuitry in accordance with the present invention, 
on the host machine but in whether the emulator produces 

code which is executable on the machine for which it is NOTATION AND NOMENCLATURE 
designed and which will run rapidly on that machine. 5 ^^^^ portions of the detailed descriptions which follow 
CoDsequcnUy, this type of emulation software does not are presented in terms of symbolic representations of opera- 
provide a method for running application programs written ^^^^ j^ta bits within a computer memory. These descrip- 
in a first instruction set to run on a different type of tions and representations are the means used by those skilled 
microprocessor for other than programming purposes. An jtig j^ta processing arts to most effectively convey the 
example ofthis type ofemulation software is described in an lO substance of their work to others skiUed in the art. The 
articleentitled, "Shade: A Fast Instruction-Set Simulator for operations are those requiring physical manipulations of 
Execution Profiling," Cmelik and Keppel. physical quantities. Usually, though not necessarily, these 
It is desirable to provide competitive microprocessors quantities take the form of electrical or magnetic signals 
which are faster and less expensive than state of the art capable of being stored, transferred, combined, compared, 
microprocessors yet are entirely compatible with target 15 ^nd otherwise manipulated. It has proven convenient at 
application programs designed for state of the art micropro- times, principally for reasons of common usage, to refer to 
cessors running any operating systems available for those these signals as bits, values, elements, symbols, characters, 
microprocessors. More particularly, it is desirable to provide terms, numbers, or the like. It should be borne in mind, 
a host processor having circuitry for enhancing the speed at however, that all of these and similar terms are to be 
which the processor functions. ^0 associated with the appropriate physical quantities and are 
SUMMARY OF THE INVENTION merely convenient labels apphed to these quantities. 

. r 1 • . i- .1 ^ • . • ^ Further, the manipulations performed are often referred to 

It IS, tnererore, an obiect or the present invention to . , u jj- • u- u i 

' , / , ^ , m terms, such as adding or companne, which are commonly 

enhance the operation of a microprocessor with apparatus • * j -^u * i ** r j u u 

- 1 .. . r associated with mental operations performed by a human 
for acceleraung the execution of programs. 25 ^^^^^^^^ ^^^^ ^^^^^^^^ ^ ^^^^^ ^p^^^^^^ ^ 

This and other objects of the present mvention are real- ^ary or desirable in most cases in any of the operations 

ized by apparatus and a method for storing data akeady described herein which form part of the present invention; 

stored at an often utilized memory address in registers local the operations are machine operations. Useful machines for 

to a host processor so that the processor may respond more performing the operations of the present invention include 
rapidly when a memory address is to be accessed. 30 ^^^^^^^ purpose digital computers or other similar devices. 

These and other objects and features of the invention will in all cases the distinction between the method operations in 
be better understood by reference to the detailed description operating a computer and the method of computation itself 
which follows taken together with the drawings in which should be borne in mind. The present invention relates to a 
hke elements are referred to by like designations throughout method and apparatus for operating a computer in process- 
the several views. iug electrical or other (e.g. mechanical, chemical) physical 
BRIEF DESCRIPTION OF THE DRAWINGS signals to generate other desired physical signals. 
FIGS. l(fl), l(bX 1(c), Id), and 1(e) are diagrams illus- ^^^^^g the following description, in some cases the target 
trating the manner of operation of microprocessors designed P^^S^^^^ ^ ^^^^^^^ a program which is designed to be 
in accordance with the prior art. 40 executed on an X86 microprocessor in order to provide 
FIG. 2 is a block diagram of a microprocessor designed in ^^^^^^^^ ^eta^ of operation because the majority of 
^ -4U 4U i - ^ ■ f emulators run X86 applications. However, the target pro- 
accordance with the present invention running an applica- , ^ \^ . V .. 

J . J r J fee * ■ gram may be one designed to run on any ramily oi target 

tion designed lor a diiierent microprocessor. * •'—...f^ . ^ ^ , 

- . ^. ii . . c- X ' computers. This mcludes target virtual computers, such as 

FIG. 3 IS a diagram illustrating a portion or the micro- u- n*-7 u- t 

L ■ T-T^ -» Pcode machines, Postscript machines, or Java virtual 
processor shown in FIG. 2. 

^ machmes. 
FIG. 4 IS a block diagram illustrating a register file used 

in a microprocessor designed in accordance with the present DETAILED DESCRIPTION 

invention. ^ . . , , t , ^ , 

T-T^ Li 1 J- ^^^ . . 1 . 1 The present mvention helps overcome the problems of the 

FIG. 5 IS a bloc^ diagram lUustratmg a gated store buffer ^^ ^^^^^ ^ microprocessor which is faster than 

desired m accordance with the present invention. microprocessors of the prior art, is capable of running all of 

FIGS. 6(a), 6(b), and 6(c) illustrate instnictions used in ^^^^^ ,he operating systems which may be 

various microprocessors of the prior art and in a micropro- a large number of families of prior art 

cesser designed m accordance with the present invenUon. microprocessors, yet is less expensive than prior art micro- 

FIG. 7 illustrates a method practiced by a software portion processors 

ofa microprocessor designed in accordance with the present ^^^^^ '^^ ^^^^ ^ microprocessor with more compU- 

cated hardware to accelerate its operation, the present inven- 

FIG. 8 iUustrates another method practiced by a software ^ ^art of a combination including an enhanced 

portion of a microprocessor designed m accordance with the hardware processing portion (referred to as a "morph host'' 
present invention. , . , 60 in this specification) which is much simpler than state of the 

FIG. 9 is a block diagram illustrating an improved com- art microprocessors and an emulating software portion 

puter system including the present invention. (referred to as "code morphing software" in this 

FIG. 10 is a block diagram illustrating a portion of the specification) in a manner that the two portions function 

microprocessor shown in FIG. 3. together as a microprocessor with more capabilities than any 

FIG. 11 is a block diagram illustrating in more detail a 65 known competitive microprocessor. More particularly, a 

translation look aside buffer shown in the microprocessor of morph host is a processor which includes hardware enhance- 

FIG. 3. ments to assist in having state of a target computer imme- 
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diately at hand when an exception or error occurs, while recalled from the translation buffer and executed without the 
code morphiag software is software which translates the need for any of these myriad of steps, 
instructions of a target program to morph host instructions A primary problem of prior art emulation techniques has 
for the morph host and responds to exceptions and errors by been the inability of these techniques to handle with good 
replacing working state with correct target state when nec- 5 performance exceptions generated during the execution of a 
essary so that correct retranslations occur. Code morpbing target program. This is especially true of exceptions gener- 
software may also include various processes for enhancing atcd in nmning the target application which are directed to 
the speed of processing. Rather than providing hardware to the target operating system where the correct target state 
enhance the speed of processing as do all of the very fast must be available at the time of any such exception for 
prior art microprocessors, the improved microprocessor proper execution of the exception and the instructions which 
allows a large number of acceleration enhancement tech- follow. Consequently, the emulator is forced to keep accu- 
niques to be carried out in selectable stages by the code rate track of the target state at all times and must constantly 
morphing software. Providing the speed enhancement tech- check to determine whether a store is to the target code area, 
niques in the code morphing software allows the morph host Other exceptions create similar problems. For example, 
to be implemented using much less complicated hardware exceptions can be generated by the emulator to detect 
which is faster and substantially less expensive than the particular target operations which have been replaced by 
hardware ofprior art microprocessors. As a comparison, one some particular host function. In particular, various hard- 
embodiment including the present invention designed to run ware operations of a target processor may be replaced by 
all available X86 applications is implemented by a morph software operations provided by the emulator software, 
host including approximately one-quarter of the number of Additionally, the host processor executing the host instruc- 
gates of the Pentium Pro microprocessor yet runs X86 tions derived from the target instructions can also generate 
applications substantially faster than does the Pentium Pro exceptions. All of these exceptions can occur either during 
microprocessor or any other known microprocessor capable the attempt to change target instmctions into host instruc- 
of processing these applications. tions by the emulator, or when the host translations are 
The code morphing software utiUzes certain techniques js executed on the host processor. An efficient emulation must 
which have previously been used only by programmers provide some manner of recovering from these exceptions 
designing new software or emulating new hardware. The efficiently and in a manner that the exception may be 
morph host includes hardware enhancements especially correctly handled. None of the prior art does this for all 
adapted to allow the acceleration techniques provided by the software which might be emulated. 

code morphing software to be utilized efficiently. These 30 In order to overcome these hmitations of the prior art, a 

hardware enhancements allow the code morphing software number of hardware improvements are included in the 

to implement acceleration techniques over a broader range enhanced morph host. These improvements include a gated 

of instmctions. These hardware enhancements also permit store buffer and a large plurality of additional processor 

additional acceleration techniques to be practiced by the registers. Some of the additional registers allow the use of 

code morphing software which are unavailable in hardware 35 register renaming to lessen the problem of instructions 

processors and could not be implemented in those proces- needing the same hardware resources. The additional reg- 

sors except at exorbitant cost. These techniques significantly isters also allow the maintenance of a set of host or working 

increase the speed of the microprocessor which includes the registers for processing the host instructions and a set of 

present invention compared to the speeds of prior art micro- target registers to hold the official state of the target proces- 

processors practicing the execution of native instruction 40 sor for which the target application was created. The target 

sets. (or shadow) registers are connected to their working register 

For example, the code morphing software combined with equivalents through a dedicated interface that allows an 

the enhanced morph host allows the use of techniques which operation called "commit" to quickly transfer the content of 

allow the reordering and rescheduling of primitive instruc- aU working registers to official target registers and allows an 

tions generated by a sequence of target instructions without 45 operation called "rollback" to quickly transfer the content of 

requiring the addition of significant circuitry. By allowing all official target registers back to their working register 

the reordering and rescheduling of a number of target equivalents. The gated store buffer stores working memory 

instructions together, other optimization techniques can be state changes on an "uncommitted" side of a hardware 

used to reduce the number of processor steps which are "gate" and official memory state changes on a "committed" 

necessary to carry out a group of target instructions to fewer 50 side of the hardware gate where these committed stores 

than those required by any other microprocessors which wiU "drain" to main memory. A commit operation transfers 

run the target applications. stores from the uncorrunitted side of the gate to the com- 

The code morphing software combined with the enhanced mitted side of the gate. The additional official registers and 

morph host translates target instructions into instructions for the gated store buffer allow the state of memory and the state 

the morph host on the fly and caches those host instructions 55 of the target registers to be updated together once one or a 

in a memory data structure (referred to in this specification group of target instructions have been translated and run 

as a "translation buffer"). The use of a translation buffer to without error. 

hold translated instructions allows instructions to be recalled These updates are chosen by the code morphing software 

without rerunning the lengthy process of determining which to occur on integral target instruction boundaries. Thus, if 

primitive instructions are required to implement each target 60 the primitive host instructions making up a translation of a 

instruction, addressing each primitive instruction, fetching series of target instructions are run by the host processor 

each primitive instruction, optimizing the sequence of primi- without generating exceptions, then the working memory 

tive instructions, allocating assets to each primitive stores and working register state generated by those instnic- 

instruction, reordering the primitive instructions, and tions are transferred to official memory and to the official 

executing each step of each sequence of primitive instruc- 65 target registers. In this manner, if an exception occurs when 

tions involved each time each target instruction is executed. processing the host instructions at a point which is not on the 

Once a target instruction has been translated, it may be boundary of one or a set of target instructions being 
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translated, the original state in the target registers at the last FIG. 2 is a diagram of morph host hardware represented 
update (or commit) may be recalled to the working registers running the same application program which is being run on 
and uncommitted memory stores in the gated store buffer the CISC processor of FIG. 1(a). As may be seen, the 
may be dumped. Then, for the case where the exception microprocessor includes the code morphing software por- 
gcnc rated is a target exception, the target instructions caus- s tion and the enhanced hardware morph host portion 
ing the target exception may be retranslated one at a time and described above. The target application furnishes the target 
executed in serial sequence as they would be executed by a instructions to the code morphing software for translation 
target microprocessor As each target instruction is correctly into host instructions which the morph host is capable of 
executed without error, the state of the target registers may executing. In the meantime, the target operating system 
be updated; and the data in the store buffer gated to memory. receives calls from the target application program and trans- 
Then, when the exception occurs again in running the host fers these to the code morphing software. In a preferred 
instructions, the correct state of the target computer is held embodiment of the microprocessor, the morph host is a very 
by the target registers of the morph host and memory; and long instruction word (VLIW) processor which is designed 
the operation may be correctly handled without delay. Each with a plurality of processing channels. The overall opera- 
new translation generated by this corrective translating may tion of such a processor is further illustrated in FIG. 6(c). 
be cached for future use as it is translated or alternatively in FIG. 6(a)-(c) are illustrated instructions adapted for 
dumped for a one time or rare occurrence such as a page use with each of a CISC processor, a RISC processor, and a 
fault. This allows the microprocessor created by the com- VLIW processor. As may be seen, the QSC instructions are 
bination of the code morphing software and the morph host of varied lengths and may include a plurality of more 
to execute the instructions more rapidly than processors for primitive operations (e.g., load and add). The RISC 
which the software was originally written. instructions, on the other hand, are of equal length and are 
It should be noted that in executing target programs using essentially primitive operations. The single very long 
the microprocessor including the present invention, many instruction for the VLIW processor illustrated includes each 
different types of exceptions can occur which are handled in of the more primitive operations (i.e., load, store, integer 
different manners. For example, some exceptions are caused add, compare, floating point multiply, and branch) of the 
by the target software generating an exception which utihzes CISC and RISC instructions. As may be seen in FIG. 6(c), 
a target operating system exception handler. The use of such each of the primitive instructions which together make up a 
an exception handler requires that the code morphing soft- single very long instruction word is furnished in parallel 
ware include routines for emulating the entire exception with the other primitive instructions either to one of a 
handling process including any hardware provided by the 3Q plurality of separate processing channels of the VLIW 
target computer for handling the process. This requires that processor or to memory to be dealt with in parallel by the 
the code morphing software provide for saving the state of processing channels and memory. The results of all of these 
the target processor so that it may proceed correctly after the parallel operations are transferred into a multiported register 
exception has been handled. Some exceptions like a page file. 

fault, which requires fetching data in a new page of memory 35 A VUW processor which may be the basis of the morph 

before the process being translated may be implemented, host is a much simpler processor than the other processors 

require a return to the beginning of the process being described above . It does not include circuitry to detect issue 

translated after the exception has been handled. Other excep- dependencies or to reorder, optimize, and reschedule primi- 

tions implement a particular operation in software where live instructions. This, in turn, allows faster processing at 

that operation is not provided by the hardware. These require higher clock rates than is possible with either the processors 

that the exception handler return the operation to the next for which the target application programs were originally 

step in the translation after the exception has been handled. designed or other processors using emulation programs to 

Each of these different types of exceptions may be efficiently run target application programs. However, the processor is 

handled by microprocessor including the present invention. not limited to VLIW processors and may function as well 

Additionally, some exceptions are generated by host hard- 45 with any type of processor such as a RISC processor, 

ware and detect a variety of host and target conditions. Some The code morphing software of the microprocessor shown 

exceptions behave like exceptions on a conventional in FIG. 2 includes a translator portion which decodes the 

microprocessor, but others arc used by the code morphing instructions of the target application, converts those target 

software to detect failure of variotis speculations. In these instructions to the primitive host instructions capable of 
cases, the code morphing software, using the state saving 50 execution by the morph host, optimizes the operations 

and restoring mechanisms described above, causes the target required by the target instructions, reorders and schedules 

state to be restored to its most recent official version and the primitive instructions into VLIW instructions (a 

generates and saves a new translation (or re-uses a previ- translation) for the morph host, and executes the host VLIW 

ously generated safe translation) which avoids the failed instructions. The operations of the translator are illustrated 
speculation. This translation is then executed. 55 in FIG. 7 which illustrates the operation of the main loop of 

The morph host includes additional hardware exception the code morphing software, 

detection mechanisms that in conjunction with the rollback In order to accelerate the operation of the microprocessor 

and retranslate method described above allow further opti- which includes the code morphing software and the 

raization. Examples are a means to distinguish memory from enhanced morph host hardware, the code morphing software 
memory mapped I/O and a means to eliminate memory 60 includes a translation buffer as is illustrated in FIG. 2. The 

references by protecUng addresses or address ranges thus translation buffer of one embodiment is a software data 

allowing target variables to be kept in registers. structure which may be stored in memory; a hardware cache 

For the case where exceptions are used to detect failure of might also be utilized in a particular embodiment. The 

other speculations, such as whether an operation affects translation buffer is used to store the host instructions which 
memory or memory mapped I/O, recovery is accomplished 65 embody each completed translation of the target instruc- 

by the generation of new translations with different memory tions. As may be seen, once the individual target instructions 

operations and different optimizations. have been translated and the resulting host instructions have 
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been optimized, reordered, and rescheduled, the resulting The morph host hardware includes a number of enhance - 

host translation is stored in the translation buffer. The host ments which overcome this problem. These enhancements 

instructions which make up the translation are then executed are each illustrated in FIGS. 3, 4, and 5. In order to 

by the morph host. If the host instructions are executed determine the correct stale of the registers at the time an 

without generating an exception, the translation may there- 5 enor occurs, a set of official target registers is provided by 

after be recalled whenever the operations required by the the enhanced hardware to hold the state of the registers of 

target instruction or instructions are required. the target processor for which the original application was 

Thus, as shown in FIG. 7, a typical operation of the code designed. These target registers may be included in each of 

morphing software of the microprocessor when furnished the floating point units, any integer units, and any other 

the address of a target instruction by the application program execution units. These official registers have been added to 

is to first determine whether the target instruction at the the morph host along with an increased number of normal 

target address has been translated. If the target instruction working registers so that a number of optimizations includ- 

has not been translated, it and subsequent target instructions j^g register renaming may be practiced. One embodiment of 

are fetched, decoded, translated, and then (possibly) the enhanced hardware includes sixty-four working registers 

optimized, reordered^ and rescheduled into a new host ^ ^^.^ thirty-two working registers in the 

translation, and stored in the tra^^^ .^^ embodiment also includes an 

lator. As will be seen later, there are vanous degrees 01 i_ j * r ^ * ■ . 1. ■ l • 1 j n c .1. 

. . . , LI T-u . u f-™- *4 » • enhanced set of target registers which include aU of the 

optimization which are possible. The term optimization is „ . 1 • . ^-.i . 

often used generically in this specification to refer to those frequently changed registers of the target processor ncces- 

techniques by which processing is accelerated. For example, ^^'y P^^^^^^ ^^^^^ ^^lat processor; these mclude 

reordering is one form of optimization which allows faster 20 condition control registers and other registers necessary for 

processing and which is included within the term. Many of control of the simulated system. 

the optimizations which are possible have been described It should be noted that depending on the type of enhanced 

within the prior art of compiler optimizations, and some processing hardware utilized by the morph host, a translated 

optimizations which were difficult to perform within the instruction sequence may include primitive operations 

prior art like "super-blocks" come from VLIW research. 25 which constitute a plurality of target instructions from the 

Control is then transferred to the translation to cause execu- original application. For example, a VLIW microprocessor 

tion by the enhanced morph host hardware to resume. may be capable of running a plurality of either CISC or 

When the particular target instruction sequence is next RISC instructions at once as is illustrated in FIG. 6(a)-(c). 

encountered in running the application, the host translation Whatever the morph host type, the state of the target 

wiU then be found in the translation buffer and immediately 30 registers of the morph host hardware is not changed except 

executed without the necessity of translating, optimizing, at an integral target instruction boundary; and then all target 

reordering, or rescheduling. Using the advanced techniques registers are updated. Thus, if the microprocessor is execut- 

described below, it has been estimated that the translation for ing a target instruction or instructions which have been 

a target instruction (once completely translated) will be translated into a series of primitive instructions which may 

found in the translation buffer all but once for each one 35 have been reordered and rescheduled into a host translation, 

million or so executions of the translation. Consequently, when the processor begins executing the translated instruc- 

after a first translation, all of the steps required for transla- tion sequence, the official target registers hold the values 

tion such as decoding, fetching primitive instructions, opti- which would be held by the registers of the target processor 

mizing the primitive instructions, rescheduling into a host for which the appUcation was designed when the first target 

translation, and storing in the translation buffer may be 40 instruction was addressed. After the morph host has begtm 

eliminated from the processing required. Since the processor executing the translated instructions, however, the working 

for which the target instructions were written must decode, registers hold values determined by the primitive operations 

fetch, reorder, and reschedule each instruction each time the of the translated instructions executed to that point. Thus, 

instruction is executed, this drastically reduces the work while some of these working registers may hold values 

required for executing the target instructions and increases 45 which are identical to those in the lo official target registers, 

the speed of the improved microprocessor. others of the working registers hold values which are 

In eliminating all of these steps required in execution of meaningless to the target processor. This is especially true in 

a target appUcation by prior art processors, the micropro- an embodiment which provides many more registers than 

cesser including the present invention overcomes problems does a particular target machine in order to allow advanced 

of the prior art which made such operations impossible at 50 acceleration techniques. Once the translated host instruc- 

any reasonable speed. For example, some of the techniques lions begin, the values in the working registers are whatever 

of the improved microprocessor were used in the emtdators those translated host instructions determine the condition of 

described above used for porting applications to other sys- those registers to be. If a set of translated host instructions 

tems. However, some of these emulators had no way of is executed without generating an exception, then the new 

running more than short portions of applications because in 55 working register values determined at the end of the set of 

processing translated instructions, exceptions which gener- instructions are transferred together to the official target 

ate caUs to various system exception handlers were gener- registers (possibly including a target instruction pointer 

ated at points in the operation at which the state of the host register). In the present embodiment of the processor, this 

processor had no relation to the state of a target processor transfer occurs outside of the execution of the host instruc- 

p recessing the same instructions. Because of this, the state 60 tions in an additional pipeline stage so it does not slow 

of the target processor at the point at which such an operation of the morph host. 

exception was generated was not known. Thus, correct state In a similar manner, a gated store buffer such as that 

of the target machine could not be determined; and the illustrated in FIG. 5 is utihzed in the hardware of the 

operation would have to be stopped, restarted, and the improved microprocessor to control the transfer of data to 

correct state ascertained before the exception could be 65 memory. The gated store buffer includes a number of ele- 

serviced and execution continued. This made running an ments each of which may hold the address and data for a 

application program at host speed impossible. memory store operation. These elements may be imple- 
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mented by any of a number of different hardware arrange- run, the next primitive function is run.) This continues until 

ments (e.g., first-in first-out buffers); the embodiment illus- an exception re-occurs or the single target instruction has 

tratcd is implemented utilizing random access memory and been translated and executed. In one embodiment, if a 

three dedicated working registers. The three registers store, translation of a target instruction is executed without an 

respectively, a pointer to the head of the queue of memory s exception being generated, then the state of working regis- 

stores, a pointer to the gate, and a pointer to the tail of the ters is transferred to the target registers and any data in the 

queue of the memory stores. Memory stores positioned gated store buffer is committed so that it may be transferred 

between the head of the queue and the gate are already to memory. However, if an exception re-occurs during the 

committed to memory, while those positioned between the running of a translation, then the state of the target registers 

gate of the queue and the tail are not yet committed to and memory has not changed but is identical to the state 

memory. Memory stores generated during execution of host produced in a target computer when the exception occurs, 

translations are placed in the store buffer by the integer unit Consequently, when the target exception is generated, the 

in the order generated during the execution of the host exception will be correctly handled by the target operating 

instructions by the morph host but are not allowed to be system. 

written to memory until a commit operation is encountered Similarly, once a first target instruction of the scries of 

in a host instruction. Thus, as translations execute, the store instructions the translation of which generated an exception 

operations are placed in the queue. Assuming these are the has been executed without generating an exception, the 

first stores so that no other stores are in the gated store buffer, target instruction pointer points to the next of the target 

both the head and gate pointers will point to the same instructions. This second target instruction is decoded and 

position. As each store is executed, it is placed in the next retranslated without optimizing or reordering in the same 

position in the queue and the tail point is incremented to the manner as the first. As each of the host translations of a 

next position (upward in the figure). This continues until a single target instruction is processed by the morph host, any 

commit command is executed. This will normally happen exception generated wiU occur when the state of the target 

when the translation of a set of target instructions has been registers and memory is identical to the state which would 

completed without generating an exception or a error exit 25 occur in the target computer. Consequently, the exception 

condition. When a translation has been executed by the may be immediately and correctly handled. These new 

morph host without error, then the memory stores in the translations may be stored in the translation buffer as the 

store buffer generated during execution are moved together correct translations for that sequence of instructions in the 

past the gate of the store buffer (committed) and subse- target application and recalled whenever the instructions are 

quently written to memory. In the embodiment illustrated, 3Q remn. 

this is accomphshed by copying the value in the register Other embodiments for accomplishing the same result as 

holding the tail pointer to the register holding the gate the gated store buffer of FIG. 5 might include arrangements 

pointer. for transferring stores directly to memory while recording 

Thus, it may be seen that both the transfer of register state data sufficient to recover slate of the target computer in case 

from working registers to official target registers and the 35 the execution of a translation results in an exception or an 

transfer of working memory stores to official memory occur error necessitating rollback. In such a case, the effect of any 

together and only on boundaries between integral target memory stores which occurred during translation and execu- 

instructions in response to explicit commit operations. tion would have to be reversed and the memory state 

This allows the microprocessor to recover from target existing at the beginning of the translation restored; while 

exceptions which occur during execution by the enhanced 40 working registers would have to receive data held in the 

morph host without any significant delay. If a target excep- official target registers in the manner discussed above. One 

tion is generated during the running of any translated embodiment for accomplishing this maintains a separate 

instruction or instructions, that exception is detected by the target memory to hold the original memory state which is 

morph host hardware or software. In response to the detec- then utilized to replace overwritten memory if a rollback 

tion of the target exception, the code morphing software may 45 occurs. Another embodiment for accomplishing memory 

cause the values retained in the official registers to be placed rollback logs each store and the memory data replaced as 

back into the working registers and any non-committed they occur, and then reverses the store process if rollback is 

memory stores in the gated store buffer to be dumped (an required. 

operation referred to as "rollback"). The memory stores in The code morphing software provides an additional 

the gated store buffer of FIG. 5 may be dumped by copying 50 operation which greatly enhances the speed of processing 

the value in the register holding the gate pointer to the programs which are being translated. In addition to simply 

register holding the tail pointer. translating the instructions, optimizing, reordering, 

Placing the values from the target registers into the rescheduling, caching, and executing each translation so that 

working registers may place the address of the first of the it may be rerun whenever that set of instructions needs to be 

target instmctions which were running when the exception 55 executed, the translator also links the different translations to 

occurred in the working instruction pointer register. Begin- eliminate in almost all cases a return to the main loop of the 

ning with this official state of the target processor in the translation process. FIG. 8 illustrates the steps carried out by 

working registers, the target instructions which were rtm- the translator portion of the code morphing software in 

ning when the exception occurred are retranslated in serial accomplishing this linking process. It will be understood by 

order without any reordering or other optimizing. After each 60 those skilled in the art that this linking operation essentially 

target instruction is newly decoded and translated into a new eliminates the return to the main loop for most translations 

host translation, the translated host instruction representing of instructions, which eliminates this overhead, 

the target instructions is executed by the morph host and Presume for exemplary purposes that the target program 

causes or does not cause an exception to occur. (If the morph being run consists of X86 instructions. When a translation of 

host is other than a VLIW processor, then each of the 65 a sequence of target instructions occurs and the primitive 

primitive operations of the host translation is executed in host instructions are reordered and rescheduled, two primi- 

sequence. If no exception occurs as the host translation is tive instructions may occur at the end of each host transla- 
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tion. The first is a primitive instruction which updates the path through the translator main loop only occurs where a 
value of the instruction pointer for the target processor (or its link does not exist. Eventually, the main loop references in 
equivalent); this instruction is used to place the correct the branch instructions of host instructions are almost com- 
address of the next target instruction in the target instruction pletely eliminated. When this condition is reached, the time 
pointer register. Following this primitive instruction is a 5 required to fetch target instructions, decode target 
branch instruction which contains the address of each of two instructions, fetch the primitive instructions which make up 
possible targets for the branch. The manner in which the the target instructions, optimize those primitive operations, 
primitive instruction which precedes the branch instruction reorder the primitive operations, and reschedule those primi- 
may update the value of the instruction pointer for the target tive operations before running any host instruction is elimi- 
processor is to test the condition code for the branch in the nated. Thus, in contrast to aU prior art microprocessors 
condition code registers and then determine whether one of which must take each of these steps each time any applica- 
the two branch addresses indicated by the condition con- tion instruction sequence is run, the work required to run any 
trolling the branch is stored in the translation buffer. The first set of target instructions using the improved microprocessor 
time the sequence of target instructions is translated, the two after the first translation has taken place is drastically 
branch targets of the host instruction both hold the same host ^5 reduced. This work is further reduced as each set of trans- 
processor address for the main loop of the translator soft- lated host instructions is linked to the other sets of translated 
ware. host instructions. In fact, it is estimated that translation will 

When the host translation is completed, stored in the ^je needed in less than one translation execution out of one 
translation buffer, and executed for the first time, the instruc- million during the running of an application, 
tion pointer is updated in the target instruction pointer 20 Those skilled in the art wiU recognize that the implemen- 
register (as are the rest of the target registers); and the tation of the microprocessor requires a large translation 
operation branches back to the main loop. At the main loop, buffer since each set of instructions which is translated is 
the translator software looks up the instruction pointer to the cached in order that it need not be translated again. Trans- 
next target instruction in the target instruction pointer reg- lators designed to function with applications programmed 
ister. Then the next target instruction sequence is addressed. 25 ^^r different systems wiU vary in their need for supporting 
Presuming that this sequence of target instructions has not buffer memory. However, one embodiment of the micropro- 
yet been translated and therefore a translation does not cesser designed to run X86 programs utilizes a translation 
reside in the translation buffer, the next set of target instruc- buffer of two megabytes of random access memory, 
tions is fetched from memory, decoded, translated, Two additional hardware enhancements help to increase 
optimized, reordered, rescheduled, cached in the translation 30 the speed at which applications can be processed by the 
buffer, and executed. Since the second set of target instruc- microprocessor which includes the present invention. The 
tions follows the first set of target instructions, the primitive first of these is an abnormal/normal (A/N) protection bit 
branch instruction at the end of the host translation of the stored with each address translation in a translation look- 
first set of target instructions is automatically updated to aside buffer (TLB) (see FIG. 3) where lookup of the physical 
substitute the addressof the host translation of the second set 35 address of target instructions is first accomplished. Target 
of target instructions as the branch address for the particular memory operations within translations can be of two types, 
condition controlling the branch. ones which operate on memory (normal) or ones which 

If then, the second translated host instruction were to loop operate on a memory mapped I/O device (abnormal), 

back to the first translated host instruction, the branch A normal access which affects memory completes nor- 

opcration at the end of the second translation would include 40 mally. When instructions operate on memory, the optimizing 

the main loop address and the X86 address of the first and reordering of those instructions is appropriate and 

translation as the two possible targets for the branch. The greatly aids in speeding the operation of any system using 

update -instruction-pointer primitive operation preceding the the microprocessor which includes the present invention. On 

branch tests the condition and determines that the loop back the other hand, the operations of an abnormal access which 

to the first translation is to be taken and updates the target 45 affects an I/O device often must be practiced in the precise 

instruction pointer to the X86 address of the first translation. order in which those operations are programmed without the 

This causes the translator to look in the translation buffer to elimination of any steps or they may have some adverse 

see if the X86 address being sought appears there. The affect at the I/O device. For example, a particular 1/0 

address of the first translation is found, and its value in host operation may have the effect of clearing an I/O register; if 

memory space is substituted for the X86 address in the 50 the primitive operations take place out of order, then the 

branch at the end of the second host translated instruction. result of the operations may be different than the operation 

Then, the second host translated instruction is cached and commanded by the target instruction. Without a means to 

executed. This causes the loop to be run until the condition distinguish memory from memory mapped I/O, it is neces- 

causing the branch from the first translation to the second sary to treat all memory with the conservative assimiptions 

translation fails, and the branch takes the path back to the 55 used to translate instruction which affect memory mapped 

main loop. When this happens, the first translated host I/O. This severely restricts the nature of optimizations that 

instruction branches back to the main loop where the next set are achievable. Because prior art emulators lacked means to 

of target instructions designated by the target instruction both detect a failure of speculation on the nature of the 

pointer is searched for in the translation buffer, the host memory being addressed, and means to recover from such 

translation is fetched from the cache; or the search in the 60 failures, their performance was restricted, 

translation buffer fails, and the target instructions are fetched In one embodiment of the microprocessor illustrated in 

from niemory and translated. When this translated host FIG. 11, the A/N bit is a bit which may be set in the 

instruction is cached in the translation buffer, its address translation look-aside buffer to indicate either a memory 

replaces the main loop address in the branch instruction page or memory-mapped I/O. The translation look-aside 

which ended the loop. 65 buffer stores page table entries for memory accesses. Each 

In this manner, the various translated host instructions are such entry includes a virtual address being accessed and the 

chained to one another so that the need to follow the long physical address at which the data sought may be accessed 
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as well as other information regarding the entry. In the memory. When the optimized, reordered, and rescheduled 
present invention, the A/N bit is part of that other informa- host instructions are then executed, the address may be 
tion and indicates whether the physical address is a memory found to refer to an 1/0 device by the condition of the A/N 
address or a memory -mapped I/O address. A translation of bit provided in the translation look-aside buffer. The com- 
an operation which affects memory as though it were a 5 p arisen of the A/N bit and the translated instruction address 
memory operation is actually a speculation that the opera- which shows that an operation is an I/O operation generates 
tion is one affecting memory. In one embodiment, when the an error exception which causes a software initiated rollback 
code morphing software first attempts to execute a transla- procedure to occur, causing any uncommitted memory 
tion which requires an access of either memory or a stores to be dumped and the values in the target registers to 
memory -mapped I/O device, it is actually presuming that the be placed back into the working registers. Then the trans- 
access is a memory access. In a different embodiment, the lation starts over, one target instmction at a time without 
software might presume the target command requires an I/O optimization, reordering, or rescheduling. This re-translation 
access. Presuming an access of that address has not previ- ^he appropriate host translation for an I/O device, 
ously been accomphshed, there will be no entry in the ^ similar manner, it is possible for a memory operation 
translation look-aside buffer; and the access will fail in the t)e incorrectly translated as an I/O operation. The error 
translation look-aside buffer. This failure causes the software generated may be used to cause its correct re-translation 
to do a page table lookup and fill a storage location of the ^^^-^eS ^eS rescheduled to 
translation look-aside buffer with the page table entry to ' , , , . 1 . . 
provide the correct physical address translation for the P™^, art emulators have ako struggled with what is 
virtual address. In accomplishing this, the software causes ^ ^^""^'^^y referred to as self modifying code. Should a targe 
A /KT ^ u • 1 jj \ 1- ^ J • program write to the memory that contains target 
tiie A^ bit for the physical address to be entered in the translations thai exist for these 
translation look-aside buffer. Then another atteinpt to instructions to become "stale- and no longer vaHd. It 
execute the access takes place once more assuming that the ^ necessary to detect these stores as they occur dynamicaUy. 
access is of a memory address. As the access is attempted, the prior art, such detection has to be accomplished with 
the target memory reference is checked by comparing the ^5 extra instructions for each store. This problem is larger in 
access type presumed (normal or abnormal) against the A/N scope than programs modifying themselves. Any agent 
protection bit now in the TLB page table entry. When the which can write to memory, such as a second processor or 
access type does not match the A/N protection, an exception a DMA device, can also cause this problem, 
occurs. If the operation in fact affects memory, then the The present invention deals with this problem by another 
optimizing, reordering, and rescheduling techniques 3Q enhancement to the morph host. A translation bit (T bit) 
described above were correctly applied during translation. If which may also be stored in the translation look- aside buffer 
the comparison with the A/N bit in the TLB shows that the is used to indicate target memory pages for which transla- 
operation, however, affects an I/O device, then execution tions exist. The T bit thus possibly indicates that particular 
causes an exception to be taken; and the translator produces pages of target memory contain target instructions for which 
a new translation one target instruction at a time without 35 host translations exist which would become stale if those 
optimizing, reordering, or rescheduling of any sort. target instructions were to be overwritten. If an attempt is 
Similarly, if a translation incorrectly assumes an I/O opera- made to write to the protected pages in memory, the pres- 
tion for an operation which actually affects memory, execu- ence of the translation bit will cause an exception which 
tion causes an exception to be taken; and the target instmc- when handled by the code morphing software can cause the 
tions are retranslated using the optimizing, reordering, and 4^ appropriate translation(s) to be invalidated or removed firom 
rescheduling techniques. In this manner, the processor can the translation buffer. The T bit can also be used to mark 
enhance performance beyond what has been traditionally other target pages that translation may rely upon not being 
possible. written. 

It will be recognized by those skilled in the art that the This may be understood by referring to FIG. 3 which 

technique which uses the A/N bit to determine whether a 45 illustrates in block diagram form the general functional 

failure of speculation has occurred as to whether an access elements of the microprocessor which includes the inven- 

is to memory or a memory-mapped I/O device may also be tion. When the morph host executes a target program, it 

used for speculations regarding other properties of memory- actually runs the translator portion of the code morphing 

mapped addresses. For example, different types of memory software which includes the only original untranslated host 

might be distinguished using such a normal/abnormal bit. 50 instructions which effectively run on the morph host. To the 

Other similar uses is distinguishing memory properties will right in the figure is illustrated memory divided into a host 

be found by those skilled in the art. portion including essentially the translator and the transla- 

One of the most frequent speculations practiced by the tion buffer and a target portion including the target instruc- 
improved microprocessor is that target exceptions will not tions and data, including the target operating system. The 
occur within a translation. This allows significant optimiza- 55 morph host hardware begins executing the translator by 
tion over the prior art. First, target state does not have to be fetching host instructions from memory and placing those 
updated on each target instruction boundary, but only on instructions in an instruction cache. The translator instruc- 
target instruction boundaries which occur on translation tions generate a fetch of the first target instmctions stored in 
boundaries. This eliminates instructions necessary to save the target portion of memory. Carrying out a target fetch 
target state on each target instruction boundary. Optimiza- 60 causes the integer unit to look to the oflEcial target instruc- 
tions that would previously have been impossible in sched- tion pointer register for a first address of a target instruction, 
uling and removing redundant operations are also made The first address is then accessed in the translation look- 
possible, aside buffer of the memory management unit. The memory 

The improved microprocessor is admirably adapted to management unit includes hardware for paging and provides 

select the appropriate process of translation. In accordance 65 memory mapping facilities for the TLB. 

with the method of translating described above, a set of Presuming that the TLB is correctly mapped so that it 

instructions may first be translated as though it were to affect holds lookup data for the correct page of target memory, the 
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target instruction pointer value is translated to the physical execution of a code sequence. To reduce the time required by 

address of the target instruction. At this point, the condition such frequent memory stores to the same address, each time 

of the bit (T bit) indicating whether a translation has been the data is to be written to the memory address, according to 

accomplished for the target instruction is detected; but the the present invention, it may be transferred to an execution 

access is a read operation, and no T bit exception will occur. 5 unit register which is designated to function in place of the 

The condition of the A/N bit indicating whether the access memory space during the period in which the code sequence 

IS to memory or memory mapped I/O is also detected. continuing. Once an execution unit register has been 

Presuming the last mentioned bit mdicates a memory designated, each change to the data requires only a simple 

location, the target mstruction is accessed m target memory register-to-register transfer operation which proceeds much 

since no translation exists. The target mstruction and sub- f^^ter than storing to a memory address, 

sequent target instructions are transferred as data to the ^ • -j 

moiph host computing units and translated under control of P'?^°' ^''^''^^f. * ""^^ arrangement to 

f 1 ^ • ^ ? * J • *u • * *• L accomplish these aliasing operations. In one embodiment 

the translator instructions stored in the instruction cache. , , . T^r/^ ia.u i.u*-j- a 

™ , . ^ . ^ J - • ' illustrated in FIG. 10, the morph host is designed to respond 

The translator instructions utilize reordering, optimizing, , ui j j . 1 -.i . . j • 

, u J V ^ u ' 4.U u *u 4 * • ^ *• to a load and protect command with respect to a desig- 

and rescheduling techniques as though the target instruction , , • ^ • . ^ j ir ^1 • 

^ , , ^ 1.- * t 4- • • ■'^ nated memory address which is to be used frequently in a 

affected memory. The resulting translation containing a , ™ , , . , • ■ ^ 

rt_ ^ ' 4 J • ^ if* code sequence. J ne morph host allocates a working register 

sequence of host instructions is then stored m the translation • -f * 1. u a* ^ 

. rr • 1. ^ -T-u* 1*- * J7 jj ^i 111 in an execution umt 110 to hold the memory data and 

buffer m host memory. The translation IS transferred directly , . r.i. 

^ , w u «■ • u * • 4U * A * stores the memory address in a special register 112 of the 

to the translation buffer m host memory via the gated store ^ ■ • . u 

ufc ^ ^ 4- u u / A' u , memory control unit. The workmg register 111 may be One 

buffer. Once the translation has been stored in host memory, on r l r • . / • 1 . 1 • • ^ 

the translator branches to the translation which then '° » a number of registers (e.g., eight of the workmg regis^ 

executes. TTie execution (and subsequent executions) will J^^^f d u^ ^G- 4) m an execution umt which may be 

J ^ -c ^i. ^ 1 *• u 7 * allocated for such a purpose, 

determine if the translation has made correct assumptions ^ ^ 

concerning exceptions and memory. Prior to executing the ^hen the invention is used to eHminate loads from a 

translation, the T bit for the target page(s) containing the . memory address to the execution unit, the data at the 

target instructions that have been translated is set. This memory address is first loaded to the register 111 and the 

indication warns that the instruction has been translated; memory address placed in the register 112. Thereafter, the 

and, if an attempt to write to the target address occurs, the ^^de sequence is executed at an accelerated rate using the 

attempt generates an exception which causes the translation ^^^a m the register 111. Dunng this penod, each operation 

to possibly be invalidated or removed. 3. ^^ich would normally require a load from the memory 

If a write is attempted to target pages marked by a T bit, ^^^'^.^^ ^l^^}"" '^^f^^' accomplished instead by 

an exception occurs and the write is aborted. The write will ^,°Py^°S ^^^^ ^'"^"^ l^^^}^' continues until 

be allowed to continue after the response to the exception sequence is complete (or terminates in some other 

assures that translations associated with the target memory °^^°^^^) protection of the memory space is removed, 

address to be written are either marked as invalid or other- 35 Similarly, in order to accelerate a code sequence which 

wise protected against use until they have been appropriately constantly stores data from an execution unit 110 to the same 

updated. Some write operations will actually require nothing memory address, a similar aliasing process may be prac- 

to be done since no valid translations will be affected. Other ^ "ioiid and protect'' command causes the memory 

write operations will require that one or more translations address to be placed in the register 112 and the data which 

associated with the addressed target memory be appropri- 40 would normally be stored at that memory address to be 

ately marked or removed. FIG. 11 illustrates one embodi- transferred instead to the working register 111. For example, 

ment of a translation look-aside buffer including storage m a computation in which a loop execution would normaUy 

positions with each entry for holding a T bit indication. be storing a series of values to the same memory address, by 

An additional hardware enhancement to the morph host is aUocating a register 111 to hold the data and holding the 

a circuit utilized to allow data which is normaUy stored in 45 ^''""^ ^^^'^^^ ^ "^^^ister 112, the process of storing 

memory but is used quite often in the execution of an becomes a register-to-register transfer within the execution 

operation to be repHcated (or "aliased") in an execution unit operation also continues until the code sequence 

register in order to eliminate the time required to fetch the ^ complete (or terminates m some other manner), the 

data from or store the data to memory. For example, if data memory space is updated, and the protection of the memory 

in memory is reused frequently during the execution of a 50 ^P^^^ ^ removed. 

code sequence, the data must typically be retrieved from Although each of these aliasing techniques greatly 
memory and loaded to a register in an execution unit each enhances the speed of execution of some code sequences, 
time the data is used. To reduce the time required by such these operations by which memory accesses are eliminated 
frequent memory accesses, the data may according to the give rise to a significant number of problems. This especially 
present invention instead be loaded once from memory to an 55 true where a substantial portion of the host processor opera- 
execution unit register at the beginning of the code sequence tions relate to translation of instrucrions between a target 
and the register designated to function in place of the instruction set and the host instruction set. All of these 
memory space during the period in which the code sequence problems are related to the necessity to assure that data 
continues. Once this has been accomplished, each of the which is to be used in the execution of an instruction is valid 
load operations which would normally involve loading data 60 '^be time it is to be used. 

to a register from the designated memory address becomes There are a number of instances in which data stored at a 

instead a simple register-to-register copy operation which memory address and data stored in an execution unit register 

proceeds at a much faster pace; and even those copy may differ so that one or the other is invahd at any particular 

operations may frequently be eliminated by further optimi- instant. For example, if a working register 111 is being used 

nation. 65 to hold data which would normally be loaded frequently 

Similarly, execution of a code sequence often requires that from the memory space to registers during a code sequence, 

data be written to a memory address frequently during the an instruction may write to the memory address before the 
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code sequence using the data in the execution unit register 
completes. In such a case, the data in the execution unit 
register being utilized by the code sequence will be stale and 
must be updated. 

As another example, if a working register is being used to 
hold data which woidd normally be stored frequently to a 
memory address during a code sequence, an instruction may 
attempt to write to the memory address before the code 
sequence using the execution unit register in place of 
memory completes. If the host processor is functioning in a 
mode in which data at the memory address is normaEy 
updated only at the end of the code sequence (a write-back 
mode), the data in the execution unit register will be stale 
and must be updated from data written to the memory 
address. Of course, if the host processor is functioning in a 
mode in which data at the memory address is normally 
updated each time it is written to the execution unit register 
(a write through mode), then the register and memory will 
be consistent. 

As yet another example, if a working register is being 
used to hold data which would normally be stored frequently 
during a code sequence to a memory address, an instruction 
may attempt to read data from the memory address before 
the code sequence transferring data to the register 111 
completes. If the host processor is functioning in a mode in 
which data at the memory address is normally updated only 
at the end of the code sequence (a write -back mode), the data 
in memory will be stale and must be updated by data from 
the execution unit register before the read is allowed. As 
with the example above, if the host processor is functioning 
in a mode in which data at the memory address is normally 
updated each time it is written to the execution unit register 
(a write through mode), then the register and memory wiU 
be consistent. 

Another possibility by which data held in memory and in 
aliasing registers may become inconsistent exists because 
the microprocessor formed by the combination of the morph 
host and the code morphing software is adapted to reorder 
and reschedule host instructions to accelerate execution. As 
will be seen in the various examples of code sequences 
provided below, once memory data has been aliased in an 
execution unit register to be tised in the execution of a code 
sequence, the data in the execution unit register may be 
copied to other registers and a process of reordering and 
rescheduling instructions may then occur. If reordering and 
rescheduling has occurred, it is possible for an instruction in 
the code sequence to write to the memory address which is 
being aliased so that the data in the execution unit register 
must be updated before further use. However, if the now- 
stale data in the execution unit register 111 has already been 
copied to additional registers and the code sequence of 
instructions using those registers has been altered, then stale 
data in registers to which the data has been copied may be 
utilized in carrying out the code sequence. Thus, a second 
order inconsistency may occur. 

To mak:e sure that loads from and stores to the memory 
address which is being protected do not take place without 
verifying that the data at the memory address and in the 
register lU are consistent after the load or store operation, 
a comparator 113 in the memory control unit is associated 
with the address register 112. The comparator 113 receives 
lo the addresses of loads from memory and stores to the 
gated store buffer directed to memory during translations. If 
a memory address for either a load or a store compares with 
an address in the register 112 (or additional registers depend- 
ing on the implementation), an exception may be generated 
depending on the mode. The code morphing software 
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responds to the exception by assuring that the memory 
address and the execution unit register 111 hold the same 
correct data. This allows the inconsistencies described above 
to be corrected. 

5 The manner in which the code morphing software 
responds depends on the particular exception. If the data are 
not the same, in one embodiment, the translation is rolled 
back and reexecuted without any "aliased"" data in an 
execution unit register. Such a solution allows the correction 

10 of inconsistencies which occur both between memory and 
the execution unit register and between memory and other 
registers which have copied the data from the execution unit 
register 111 before the code sequence was reordered or 
rescheduled. Other possible methods of correcting the prob- 

15 lem are to update the execution unit register with the latest 
memory data or memory with the latest load data. 

During the period in which a memory address is aliased 
to eliminate loads from that memory address, the compara- 
tor looks for attempts to write the memory address since the 
data in the execution unit register 111 may become stale 
when the new data is written to the memory address. In such 
a case, the comparator 113 detects the attempt to write to the 
protected memory address; and generates an exception if 
such an attempt occurs. The exception either causes the data 
in memory to be written to the register 111 to update the 
register before the register data may be used further, or 
causes a rollback and execution of code that does not use an 
execution unit register to accomplish alias optimization. 
This may involve re -translation of the target code. 

During the period in which a memory address is aliased 
to allow sequential store operations using a register 111 to 
represent the memory address, the generation of an excep- 
tion for a store to the memory address may be disabled by 
a command which places the circuitry in a mode (write 
through mode) in which stores to the memory address from 
the register 111 may occur without an alias check thereby 
allowing the repetitive storage to memory at the protected 
address from the register. 

40 Alternatively, during a period in which a memory address 
is aliased to allow store operations using a register 111 to 
represent the memory address, the circuitry may be placed 
in a mode (write back mode) in which the data at the 
memory location is not updated until the code sequence has 

45 been completed or otherwise terminated. In such a mode, a 
write by an instruction to the memory address may require 
that the data held in the execution unit register be updated to 
be consistent with the new data. On the other hand, in such 
a mode, an attempt to read the memory address will require 

50 that an exception be generated so that the data held in the 
memory space can be updated to be consistent with the new 
data in the execution unit register before it is read. 

FIG. 12 illustrates alias circuitry including one embodi- 
ment of a comparator 120 for detecting and controlling load 

55 and store operations to protected memory space in accor- 
dance with the present invention. The comparator 120 
includes a plurality of storage locations 122 (only one of 
which is illustrated) such as content addressable memory for 
entries of memory addresses which are to be protected. For 

60 example, there may be eight locations for entries. Each entry 
includes a sufficient number of bit positions (e.g., 32) to 
store a physical address for the memory location, a byte 
mask, and various attribute bits. Among the attribute bits are 
those indicating the size of the protected memory and 

65 whether the memory address is normal or abnormal. It 
should be noted that the locations for entries in the com- 
parator 120 are each equivalent to a register 112 shown in 
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FIG. 10 so that the comparator 120 accomplishes the pur- naiing the load or store operations. This improves schedul- 

pose of both register 112 and comparator 113 of FIG. 10. ing performance and is useful for code where there is no 

The ahas circuitry also includes an alias enable register repetition of load or store operations. 

124, a register 125 for shadowing the alias enable register, t>e recognized by those skilled in the art that the 
an alias fault register 126, a register 127 storing an indica- 5 microprocessor may be connected in circuit with typical 

tion (e.g., a single bit) that the alias circuitry is enabled, and computer elements to form a computer such as that iUus- 

a register 128 storing a mode bit. FIG. 9 As may be seen, when used m a modem 

, , . X86 computer the microprocessor is lomed by a processor 

In operation, a physical address to be protected is stored ^ ^^^^^^^ ^^^^ ^ 

m one of the locations for entnes together with a byte mask ^^^^^^1 circuitry is arranged to provide access to main 

the bits of which mdicate which bytes of the location are memory as well as to cache memory which may be utilized 

protected. Such a physical address may address 64 bits of vvith the microprocessor. The memory and bus control 

data so that each bit of the byte mask indicates one byte of circuitry also provides access to a bus such as a PCI or other 

the data at the address. The particular entry which is pro- lo^al bus through which I/O devices may be accessed. The 

tected is indicated by setting a particular bit of the hardware particular computer system will depend upon the circuitry 

enable register 124. The register 125 shadows the values in ^^^-^^^ ^-^j^ ^ lypic^ microprocessor which the present 

the register 124 at commit points during translation to allow microprocessor replaces. 

rollbacks to occur during trandalion In the embodiment ^^j,, jji^j.^j^ operation of the processor and the 

shown, the enable register 24 and the shadow enable register j„ acceleration of execution occurs, the 

are physically distnbuted as attribute bits of the storage translation of a smaU sample of X86 target code to host 

locations 122 ... . . 

primitive instructions is presented at this point. The sample 

When aliasing is enabled as indicated by the register 127, illustrates the translation of X86 target instmctions to morph 
depending on the condition in which the mode is set as host instmctions including various exemplary steps of 
indicated by the register 128, the comparator holds a physi- optimizing, reordering, and rescheduling by the micropro- 
cal memory address and byte mask and uses those to test cessor which includes the invention. By following the pro- 
addresses of stores to memory or both loads and stores. If the cess illustrated, the substantial difference between the opera- 
mode is set to a write through condition, then memory is tions required to execute the original instructions using the 
continuaUy updated from the execution unit register holding target processor and the operations required to execute the 
data for the protected memory address so that loads from translation on the host processor will become apparent to 
that memory address to other addresses are always up to date those skilled in the art. 

and need not be checked. However, stores to the memory -phe original instruction illustrated in C language source 

address may invahdate the data in the execution unit register ^ode describes a very brief loop operation. Essentially, while 

112 so these stores must be tested. If a store is to a protected ^^^^ ^^riable "n» which is being decremented after each 

address and its byte mask shows that data is being stored to jo^p remains greater than "0", a value "c" is stored at an 

a protected byte at the memory address held in the com- j^^j^j^ess indicated by a pointer "*s" which is being incrc- 

parator 120, then the comparator generates an ahas excep- mented after each loop, 
tion in order to test stores in the write through mode. 

On the other hand, if the mode is set to a write back 

condition, then the memory address is only updated when Original c code 

the alias hardware is released or when exceptions occur. whiie( (n--)>o) { 

Consequently, the data at the memory address may be stale wiii32 x86 instructioas produced by a compiler compUing this C code. 

so both load and stores must be tested when the alias mov %eac, [%ebp+Oxc] // load c from memory address into the 

hardware is enabled. To accomplish this, if either a load or 

a store is to a protected address and its byte mask shows that l%ebp-^Qx8] // load s from memory address into the 

data is being accessed at a protected byte at the memory [%eaxl%ecx // store c into memory address s held in 

address held in the comparator 120, then the comparator %eax 

generates an alias exception. ^'^^ %eax,#4 // increment s by 4. 

A j'-iu mov [%ebp+0x8],%eax // store (s + 4) back into memory 

An exception caused m either mode sets an appropnate bit ^^^^ [%ebp-KlxiO] // load a from memory address bto the 

in the alias fault register 126 to designate the address causing %eax 

the exception. Depending on the particular exception ban- 50 %ecx, l%eax-l] // decrement n and store the result in 

dler of the code morphing software, the particular exception ,^ ^ , . ^ . 

. -1.11^ .^L Li i inov [%cbp+OxlO],%ccx // store (n-l J into memory 

generated may repair or rollback to correct the problem. A %cax:%eax // test n to set the conditioa codes 

repair of the problem causes the most up-to-date data to be jg .-Oxib // branch to the top of this section if 

placed in the particular bytes affected of the execution unit "n>0" 

data register and the memory address. A rollback causes the cc ^ \ , , . 7; \ ! 

c ^* ' ^ ^ i_ 1 J t_ * ^ 1. ij • Notation: !... I indicates an address expression for a memory operand. In the 

State of the registers to be replaced by the state held m the ^^^pj, ^^o^e, the address for a memory operand is fonSed from the 

target registers; this includes the state of the enable register contents of a register added to a hexadecimal constant indicated by the Ox 

124 which is rolled back to the state held in the register 125. prefix. Target registers are indicated with the 7o prefix, e.g. %ecx is the ecx 

The use of alias detection hardware to allow optimizations Sfef i.^r^on"?^y™ °' *° 
that eliminate loads and stores and also to allow the 60 jg - jump if greater 

re-ordering or re-scheduling of operations dependent upon " "^o^e 

the eliminated loads and stores has been described. The l^d'-^^D^"'^''^ 
re-ordering enables better scheduling of operations in a 

machine with parallel execution resources, such as a super- [n this first portion of the sample, each of the individual 
scaler or VLJW machine. 65 X86 assembly language instructions for carrying out the 

The method can also be used to allow the safe re-ordering execution of the operation defined by the C language state- 

of operations dependent upon loads or stores, without elimi- ment is listed by the assembly language mnemonic for the 
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Operation followed by the parameters involved in the par- 
ticular primitive operation. An explanation of the operation 
is also provided in a comment for each instruction. Even 
though the order of execution may be varied by the target 
processor from that shown, each of these assembly language 
instructions must be executed each time the loop is executed 
in carrying out the target C language instructions. Thus, if 
the loop is executed one hundred times, each instruction 
shown above must be carried out one hundred times. 



Shows each X86 Instruction shown above followed by the host 
instructions necessary to implement the X86 Instruction, 
mov 
add 



Id 

Recx 
mov 
add 

Id 

Recx 



%ecx, [%ebp+Oxc] 
R0,Rebp,Oxc 

Recx, [RO] 

%eax, [%ebp+0x8] 
R2,Rebp,0x8 

Reax, [R2] 

[%eax], %ecx 

[Reax],Recx 



// load c from memory address into ecx 
; form the memory address and put it in 
RO 

; load c from memory address in RO into 

// load s from memory address into %eax 
; form the memory address and put it in 
R2 

; load s from memory address in R2 into 

// store c into memory address s held in 
%eax 

; store c into memory address s held in 



%eax, #4 // increment s by 4 

Reax,Reax,4 ; increment S by 4 

[%ebp+0x8], %eax // store (s + 4) back into memory 
R5,Rebp,0x8 ; form the memory address and put it in 

R5 

[R5]^cax ; store (s + 4) back into memory 

%cax, [%cbp+OxlO] // load n from memory address into %cax 
R7,Rcbp,OxlO ; form the memory address and put it in 

R7 

Reax, [R7] ; load n from memory address into the 

Rcax 

^eax-1] // decrement n and store the result in 



%ecx. 



st 

Reax 
add 
add 
mov 
add 

St 

mov 
add 

Id 

lea 
%ccx 
sub 
mov 
add 

St 

and 
andcc 

jg 
"n>D" 

jg mainloop,mainloop ; jump to the main loop 

Host Instruction Icey: 

Id = load add = ADD st = store 

sub B subtract jg » jump if condition codes indicate 

greater 

andcc " and set the condition codes 



Recx,Reax,l 



; decrement n and store the result in Recx 
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[%ebp+OxlO], %ecx // Store (n - 1) into memory 



R9,Rebp,OxlO 

[R9]3ecx 
%eax, %eax 
Rll,Reax,Reax 
.-Oxlb 



; form the memory address and put it in 
R9 

; store (n - 1) into memory 
// test n to set the condition codes 
; test n to set the condition codes 
// branch to the top of this seaion if 
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-continued 



chku 

segment 

add 

address 
Id 

Recx 

mov 

add 

chkl 

segment 

chku 

segment 

add 

address 
Id 



RO,R_FFFFFFFF 
upper limit 
Rl,R0,R6s_base 

Recx, [Rl] 

%cax, (%6bp+Qx8] 

R2,Rel^,0x8 

R2,Rss_limit 
lower limit 

R2,R_FFFFFFFF 
upper limit 

R3,R2,Rss_basc 

Rcax, [R3] 



mov [%eaxl %ecx 
chku Reax,Rds_liinit 
segment upper limit 



; Check the logical address against 
; add the segment base to form the linear 
; load c from memory address in Rl into 
// load 8 

; form logical address into RO 

; Check the logical address against 

; Check the logical address against 

; add the segment base to form the linear 

; load s from memory address in R3 into 
Ra 

// store c into [s] 

; Check the logical address against 



add 
address 



R4,Reax,Rds_base ; add the segment base to form the linear 



st [R4lRecx 
add %eax, #4 
addcc Reax,Reax,4 
mov [%ebp+0x8], %eax 
add RS,Rebp,0x8 
chkl R5,Rss_limit 
segment lower limit 
25 chku R5,IL_FFFFFFFF 
segment upper limit 
add R6,R5,Rss_base 
address 

st [R6],Reax 
mov %eax, [%ebp+Qxl0] 
30 add R7,Rebp,0xlO 
chkl R7,Rss_liinit 
segment lower limit 
chku R7,R_JTETFFFF 
segment upper limit 
add R8,R7,Rss_base 
address 

Rcax, [R8] 



Id 
Rcax 
lea 
sub 
mov 
add 
chkl 
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The next sample illustrates the same target primitive 
instructions which carry out the C language instructions. 
However, following each primitive target instruction are 
listed primitive host instructions required to accomplish the 
same operation in one particular embodiment of the micro- 
processor in which the morph host is a VLIW processor 
designed in the manner described herein. It should be noted 
that the host registers which are shadowed by official target 
registers are designated by an "R" followed by the X86 
register designation so that, for example, Reax is the work- 
ing register associated with the EAX oflScial target register. 



Adds host instructions necessary to perform X86 address computation 

and upper and lower segment limit checks. 

mov %ecx, [%ebp-»-Oxc] // load c 

add R0,Rebp,Oxc ; form logical address into RO 

chkl RO,Rssljimit ; Check the logical address against 65 

segment lower limit 



%ecK, [%eax-l] 
Rccx,Rcax,l 
[%cbp+OxlO], %ecx 
R9,Rcbp,0xl0 
R9.Rss_limit 
segment lower limit 
chku R9,R_FFFFFFFF 
segment upper limit 
add R1039,Rss_base 
address 
45 st [RlOlRecx 
address in RIO 
and %eax, %eax 
andcc Rl 1 ^eax,Reax 
jg .-Oxlb 
"n>0" 

50 jg mauiloop,mainlDOp 
Host Instruction key: 

chkl + check lower limit 
chku = check upper limit 



; store c into memory address s 

// increment s by 4 

; increment s by 4 

// store (s + 4) to memory 

; form logical address into R5 

; Check the logical address against 

; Check the logical address against 

; add the segment base to form the linear 

; store (e + 4) to memory address in R6 
// load n 

; form logical address into R7 

; Check the logical address against 

; Check the logical address against 

; add the segment base to form the linear 

; load n from memory address in R8 into 

// decrement n 
; decrement n 
// store (n - 1) 

; form logical address into R9 
; Check the logical address against 

; Check the logical address against 

; add the segment base to form the linear 

; store n-1 in Recx into memory using 

// test n to set the condition codes 
; test n to set the condition codes 
// branch to the top of this section if 

; jump to the main loop 



The next sample illustrates for each of the primitive target 
instructions the addition of host primitive instructions by 
which addresses needed for the target operation may be 
generated by the code morphing software. It should be noted 
that host address generation instructions are only required in 
an embodiment of a microprocessor in which code morphing 
software is used for address generation rather than address 
generation hardware. In a target processor such as an X86 
microprocessor these addresses are generated using address 
generation hardware. Whenever address generation occurs 
in such an embodiment, the calculation is accomplished; and 
host primitive instructions are also added to check the 
address values to determine that the calculated addresses are 
within the appropriate X86 segment limits. 
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-continued 



Adds instructions to maintain the target X86 instruction pointer "eip" and 
the commit instructions that tise the special morph host hardware to 
update 
XS6 state. 

mov %ecx, [%ebp4-0xc] // load c 

add RO,Rebp,Oxc 

chkl RO,Rss_limit 

chku RO,R_FFFFFFFF 

add Rl,RO,Rss_base 

Id Recx, [Rl] 

add Reip,Reip,3 ; add X86 instruction length to eip 

in Reip 

commit ; commits working state to official 

state . 

mov %eax, [%ebp-i-0x8] // toad s 

add R2,Rebp,0x8 

chkl R2,Rss_lLmit 

chku R2,R_FFFFFFFF 

add R3,R2,Rss_base 

Id Rcax, [R3] 

add Rcip,Rcip,3 ; add X86 instruction length to cip 

in Rcip 

commit ; commits working state to official 

state 

mov [%cax], %ccx // store c into [s] 

chku Reax,Rds_Jimit 

add R4,Reax,Rds_base 

St [R4]3ecx 

add Reip,R&ip»2 ; add X86 instruction l&ngth to eip 

in Reip 

commit ; commits working state to official 

state 

add %eax, #4 // increment s by 4 

addcc Rea!^Reax,4 

add Reip,Reip,5 ; add XS6 instruction length to eip 

in Reip 

commit ; commits working state to official 

state 

mov [%ebp+0x8], %eax // store (s + 4) 

add R5,Rebp,0x8 

chkl R5,Rss_iimit 

chku R5,S^FFFFFFFF 

add R6,R5,Rss_base 

st [R6]^eax 

add ReipjReip,3 ; add X86 instruction length to eip 

in Reip 

commit ; commits working state to official 

state 

mov %eax, [%ebp+OxlO] // load n 

add R7,Rebp,OxlO 

chkl R7,Rss_limit 

chku R7,R_FFFFFFFF 

add R8,R7,Rss_basc 

Id Rcax, [R8] 

add Rcip,Rcip,3 ; add X86 instruction length to eip 

in Rcip 

commit ; commits working state to official 

state 

lea %ecx, [%cax-l] // dccicmettt u 

sub Recx,Reax,l 

add Reip,Reip,3 ; add X86 instruction length to eip 

in Reip 

commit ; commits working state to official 

state 

mov [%ebp+OxlOl %ecx // store (n - 1) 

add R9,Rebp,OxlO 

chkl R9,Rss_limit 

chku Rg,R__FFFFFFFF 

add R10,R9,Rss_base 

St [R10],Recx 

add Reip,Reip,3 ; add X86 instruction length to eip 

in Reip 

commit ; commits working state to official 

State 

and %eax, %eax // test n 

andcc Rll, Reax, Reax 

add Reip,Reip,3 

commit ; commits working state to official 

state 

jg .-Oxlb // branch "n>0" 



; commits working state to official 



add RBeq,Reip,Length(jg) 
Idc Rtarg,EIP{taiget) 
selcc Reip,Rseq,Rtarg 
5 commit 
state 

jg mainloop,mainloop 

Host Instruction key: 

commit » copy the contents of the working registers to the 
official target registers and send working stores to memory 
10 



This sample illustrates the addition of two steps to each 
set of primitive host instructions to update the official target 
registers after the execution of the host instructions neces- 

j5 sary to canry out each primitive target instruction and to 
commit the uacommitted values in the gated store buffer to 
memory. As may be seen, in each case, the length of the 
target instruction is added to the value in the working 
instruction pointer register (Reip). Then a commit instruc- 
tion is executed. In one embodiment, the conamit instruction 
copies the current value of each working register which is 
shadowed into its associated of&cial target register and 
moves a pointer value designating the position of the gate of 
the gated store buffer from immediately in front of the 
uncommitted stores to immediately behind those stores so 

25 that they will be placed in memory. 

It will be appreciated that the list of instructions illustrated 
last above are all of the instructions necessary to form a host 
translation of the original target assembly language instruc- 
tions. If the translation were to stop at this point, the number 
of primitive host instructions would be much larger than the 
number of target instmctions (probably six times as many 
instructions), and the execution could take longer than 
execution on a target processor. However, at this point, no 
reordering, optimizing, or rescheduling has yet taken place. 

If an iastruction is to be run but once, it may be that the 
time required to accomplish further reordering and other 
optimization is greater than the time to execute the transla- 
tion as it exists at this point. If so, one embodiment of the 
microprocessor ceases the translation at this point, stores the 
translation, then executes it to determine whether exception 
or errors occur. In this embodunent, steps of reordering and 
other optimization only occur if it is determined that the 
particular translation will be run a number times or other- 
wise should be optimized. This may be accomplished, for 

45 example by placing host instructions in each translation 
which count the number of times a translation is executed 
and generate an exception (or branch) when a certain value 
is reached. The exception (or branch) transfers the operation 
to the code morphing software which then implements some 

50 or all of the following optimizations and any additional 
optimizations determined useful for that translation. A sec- 
ond method of determining translations being run a number 
of times and requiring optimization is to interrupt the 
execution of translations at some frequency or on some 
statistical basis and optimize any translation running at that 
time. This would ultimately provide ^ that the instructions 
most often run would be optimized. Another solution would 
be to optimize each of certain particular types of host 
instructions such as those which create loops or are other- 
wise likely to be run most often. 

60 



Optimization 

Assumes 32 bit flat address space which allows the elimination of segment 
base additions and some limit checks. 
^5 Win32 uses Flat 32b segmentation 
Record Assumptions: 
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Rss base==0 






T> 1 n 

KSS lUIllt"U 






Rds l!)as£~~0 






■Rds luillt™™rrrrrrrr 






SS and DS protection check 




mov 


%ecx,[%ebp+Oxc] 


/t InwH f 

if iUaU C 


add 


R0,Rcbp30xc 




chku 


RO R FFFFFFFF 




Id 


Rccx, [KUJ 




add 


Rcip,Rcip,3 




commit 






mov 


TOCaA^y /DCUpTUAOJ 


// load s 


add 


R2,Rebp,0x8 










1r1 


ivcax, i^iv^j 




add 


Reip,Reip,3 




commit 






mov 




1} storc c into Js] 


chku 


Keax,K i^i^i^brrrr 










auu 


l\&ip,i\Cip,^ 




commit 






add 






addcc 






add 


Reip,Reip,5 




commit 








[^vtoeDp+uxojj /©eax 


1} store (s + 4) 


add 






chku 


R5,R__FFFFFFFF 




^ jj 


I^Kjj^eax 




add 


xeip,Keip,3 




commit 






mov 


70 eax,[ ToeDp+uxiuj 


// loaa n 


add 






chku 


R7 J R_FFFFFFFF 




Id 


Reax, [K7J 




add 


Reip,Reip,3 




commit 






lea 


%ecXj[%peax— IJ 


// decrement n 


sub 


Rccx,Rcax,l 




add 


J\CipjilCipjJ 




commit 












add 


K9,Rcbp,Uxlu 




chku 


R9,R_FFFFFFFF 




5t 


[R9]^ccx 




add 


Rcip,Rcip,3 




commit 






and 


%eax,%eax 


// test ti 


andcc 


Rll.Reax.Reax 




add 


Reip3eip,3 




commit 






jg 


.-Oxlb 


/; branch "n>0" 


add 


Rseq,Reip,Length(jg) 




Idc 


Rtarg,EIF(target) 




selcc 


Reip,Rseq^taTg 




commit 






jg 


mainloop,mainlo op 





10 



20 



This sample illustrates a first stage of optimization which 
may be practiced utilizing the improved microprocessor 
This stage of optimization, like many of the other operations 
of the code morphing software, assumes an optimistic result. 
The particular optimization assumes that a target application 
program which has begun as a 32 bit program written for a 
flat memory model provided by the X86 family of proces- 
sors will continue as such a program. It will be noted that 
such an assumption is particular to the X86 family and 
would not necessarily be assumed with other families of 
processors being emulated. 

If this assumption is made, then in X86 applications all 
segments are mapped to the same address space. This allows 
those primitive host instructions required by the X86 seg- 
mentation process to be eliminated. As may be seen, the 
segment values are first set to zero. Then, the base for data 
is set to zero, and the limit set to the maximum available 



memory. Then, in each set of primitive host instructions for 
executing a target primitive instruction, the check for a 
segment base value and the computation of the segment base 
address required by segmentation are both eliminated. This 
reduces the loop to be executed by two host primitive 
instructions for each target primitive instruction requiring an 
addressing function. At this point, the host instruction check 
for the upper memory Hmit still exists. 

It should be noted that this optimization requires the 
speculation noted that the application utiUzes a 32 bit flat 
memory model. If this is not true, then the error will be 
discovered as the main loop resolves the destination of 
control transfers and detects that the source assumptions do 
not match the destination assumptions. A new translation 
will then be necessary. This technique is very general and 
can be applied to a variety of segmentation and other 
"moded" cases where the "mode"' changes infrequently, like 
debug, system management mode, or "real" mode. 



Assume data addressed includes no bytes outside of computer memory 
limits which can only occur on unaligned page crossing memory 
references at the upper memory limit, and can bo handled by 
special case software or hardware. 





mov 


%ccx, [%cbp+0xc] 


// load c 




add 


RO,Rcbp,Oxc 




25 


Id 

add 

commit 


Recx,[RO] 
Reip,Reip,3 






mov 


%eax, [%cbp+Qx8] 


// load s 




add 


R2,Rebp,0x8 






Id 


J<eax,[K.2j 




30 


add 


Reip,Rcip,3 




commit 








mov 


[%eaxl %ecx 


// store c into [s] 




St 


[Reax]JRecx 






add 


Reip,Reip,2 






commit 






35 


add 


%eax, #4 


// increment s by 4 


addcc 
add 

commit 


Reax,Reai,4 
Reip,Reip,5 




mov 


[%ebp+OxBl %eax 


// store (s + 4) 




add 


RS,Rebp,0x8 






St 


[R5],Reax 




40 


add 

commit 


Reip,Reip,3 






mov 


%eax, [%ebp+OxlO] 


// load n 




add 


R7,Rebp,0xlO 






Id 


Reax,[R7] 






add 


Reip,Reip,3 




45 


commit 








lea 


%ecx, [%eax-l] 


// decrement n 




sub 


Recx,Reax,l 






add 


Rcip,Rcip,3 






commit 








mov 


[%cbp+OxlO],%ccx 


// store (n - 1) 


50 


add 


R9,Rcbp,0xlQ 




St 

add 

commit 


[R91RCCX 
Rcip,Rcip,3 






and 


%eax, %eax 


// test n 




andcc 


R113eax,Reax 




55 


add 


Reip,Reip,3 




commit 








jg 


.-Oxlb 


// branch "n>0" 




add 


Rseq,Reip,LengthC)g) 






Idc 


Rtarg3IP(target) 






selcc 


Reip,Rseq,Rlarg 






commit 






60 


jg 


mainloop.mainloop 





65 



Host Instruction key: 

selcc « Select one of the source registers and copy its contents 
to the destination register based on the condition codes. 



The above sample illustrates a next stage of optimization 
in which a speculative translation eliminates the upper 
memory boundary check which is only necessary for 
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unaligned page crossing memory references at the top of the 
memory address space. Failure of this assumption is 
detected by either hardware or software alignment fix up. 
This reduces the translation by another host primitive 
instruction for each target primitive instruction requiring 
addressing. This optimization requires both the assumption 
noted before that the appUcation utilizes a 32 bit flat memory 
model and the speculation that the instruction is aligned. If 
these are not both true, then the translation will fail when it 
is executed; and a new translation will be necessary. 



Detect and eliminate redundant address calculations. The example shows 
the code aftcj eliminating the redundant operations. 



mov 


%ecx, [%ebp+Oxc] 


// load c 


add 


i\u , Keop^u xc 




Id 


Recx,(RO] 




add 


Reip,Reip,3 




commit 






mov 




11 ioaa s 


add 


R2,Reop,Ox8 




Id 


Rea3^[R2] 




add 


Reip,Reip3 




commit 






mov 


[%eax], %ecx 


// store c into [s] 


St 


[Reax],Recx 




add 


Reip,Reip,2 




commit 






add 




// increment s by 4 


addcc 


Reax,Reax,4 




add 


Reip ^Reip ^5 




commit 






mov 


[%ebp+0x8], %eax 


// store (s + 4) 


St 


[R2]^eax 




add 


Reip,Reip,3 




commit 






mov 


%eax, [%ebp+OxlO] 


// load n 


add 


R7,Rebp,OxlO 




Id 


Reax,[R7] 




add 


Reip,Reip,3 




commit 






lea 


%ecx, [%eax-l] 


// decrement n 


sub 


Rcc3^Rcax,l 




add 


Reip,Reip,3 




commit 






mov 


[%ebp+OxlO], %ccx 


// store (n - 1) 


St 


[RTj^ecx 




add 


Reip,Rcip,3 




commit 






and 


%eax,%eax 


// test n 


andcc 


Rll.Reax,Reax 




add 


Reip,Reip,3 




rommit 






jg 


.-Qxlb 


{) branch "n>0" 


add 


Rseq,ReipA^ngth(jg) 




Idc 


Rtarg,EIP(targct) 




selcc 


Reip,Rseq^targ 




commit 






jg 


mainloap,mainloop 





This sample illustrates a next optimization in which 
common host expressions are eliminated. More particularly, 
in translating the second target primitive instruction, a value 
in working register Rebp (the working register representing 
the stack base point register of an X86 processor) is added 
to an offset value 0x8 and placed in a host working register 
R2. It will be noted that the same operation took place in 
translating target primitive instruction five in the previous 
sample except that the result of the addition was placed in 
working register R5. Consequently the value to be placed in 
working register R5 already exists in working register R2 
when host primitive instruction five is about to occur. Thtis, 
the host addition instruction may be eliminated from the 
translation of target primitive instruction five; and the value 
in working register R2 copied to working register R5. 
Similarly, a host instruction adding a value in working 



register Rebp to an offset value 0x10 may be eliminated in 
the translation of target primitive instruction eight because 
the step has already been accomplished in the translation of 
target primitive instruction six and the result resides in 
register R7. It should be noted that this optimization does not 
depend on speculation and consequently is not subject to 
failure and retranslation. 



10 
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Assume that target exceptions will not occur within the translation so 
delay updating cip and target state. 



20 



mov 


%ecic,l% ebp+Oxc] 


// load c 


add 


RO,Rebp,0xc 




Id 


Recx, [RO] 




mov 


%eax,[%cbp+0x8] 


// load s 


add 


R2,Rebp,0x8 




Id 


Reax, [R2] 




mov 


[%eax],%ecx 


// store c into [s] 


St 


[Reax],Rccx 




add 


%eax,#4 


// increment s by 4 


add 


Rcax,Rcax,4 
l%ebp+0x8],%eax 




mov 


// store (s + 4) 


St 


[R2],Reax 




mov 


%eax,[%ebp+OxlO] 


// load a 


add 


R7,Rebp,OxlO 




Id 


Reax, [R7] 




lea 


%ecx,l%eax-l] 


// decrement n 


sub 


Recx,Reax4 




mov 


I9&ebp+0x]0l%ecx 


// store (n - 1) 


St 


lR7l,Recx 




and 


%eax,%eax 


// test Q 


andcc 


Rll,Rcax,Reax 




jg 
add 


.-Oxlb 

Rseq^eip,Length(blo ck) 


// branch "n>0" 


Idc 


Rtaig,EIP(target) 




selcc 


Reip,Rseq,Rtarg 




commit 






jg 


mainloop ,maiiiIoop 





The above sample illustrates an optimization which 
speculates that the translation of the primitive target instruc- 
tions making up the entire translation may be accomplished 
without generating an exception. If this is true, then there is 
no need to update the official target registers or to commit 
40 the uncommitted stores in the store buffer at the end of each 
sequence of host primitive instructions which carries out an 
individual target primitive instruction. If the speculation 
holds true, the official target registers need only be updated 
and the stores need only be committed once, at the end of the 
45 sequence of target primitive instructions. This allows the 
eUmination of two primitive host instructions for carrying 
out each primitive target instruction. These are replaced by 
a single host primitive instruction which updates the official 
target registers and commits the uncommitted stores to 
50 memory. 

As will be understood, this is another speculative opera- 
tion which is also highly likely to involve a correct specu- 
lation. This step offers a very great advantage over all prior 
art emulation techniques if the speculation holds true. It 
55 allows all of the primitive host instructions which carry out 
the entire sequence of target primitive instructions to be 
grouped in a sequence in which all of the individual host 
primitives may be optimized together. This has the advan- 
tage of allowing a great number of operations to be run in 
60 parallel on a morph host which takes advantage of the very 
long instruction word techniques. It also allows a greater 
number of other optimizations to be made because more 
choices for such optimizations exist. Once again, however, 
if the speculation proves untrue and an exception is taken 
when the loop is executed, the official target registers and 
memory hold the official target state which existed at the 
beginning of the sequence of target primitive instructions 
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since a commit does not occur until the sequence of host 
instructions is actually executed. All that is necessary to 
recover from an exception is to dump the uncommitted 
stores, rollback the official registers into the working 
registers, and restart translation of the target primitive 
instructions at the beginning of the sequence. This 
re-translation produces a translation of one target instruction 
at a time, and the official state is updated after the host 
sequence representing each target primitive instruction has 
been translated. This translation is then executed. When the 
exception occurs on this re-translation, correct target state is 
immediately available in the official target registers and 
memory for carrying out the exception. 



10 



15 



In summary: 



add 


RO,Rebp,0xc 


Id 


Recx, [RO] 


add 


R2,Rebp,0x8 


Id 


Reax, [R2] 


St 


[ReaxjRecx 


add 


Rcax^cax,4 


St 


[R2lRcax 


add 


R7,Rcbp,0xl0 


Id 


Rcax, [R7] 


sub 


Rccx^caXjl 


St 


[R71RCCX 


andcc 


Rll,Rcax,Rcax 


add 


Rseq,Reip,Leiigth(b lock) 


Idc 


Rtaig,EIP(targct) 


selcc 


Reip.Rseq.Rtarg 


commit 




jg 


mainloop^ainloop 



a different unused working register to eliminate the possi- 
bility that two host instructions will require the same hard- 
ware. Thus, for example, the second host primitive instruc- 
tion in two samples above uses working register Recx which 
represents an official target register ECX. The tenth host 
primitive instruction also uses the working register Recx. By 
changing the operation in the second host primitive instruc- 
tion so that the value pointed to by the address in RO is 
stored in the working register RI rather than the register 
Recx, the two host instructions do not both use the same 
register. Similarly, the fourth, fifth, and sixth host primitive 
instructions all utilize the working register Reax in the 
earlier sample; by changing the fourth host primitive instruc- 
tion to utilize the previously unused working register R3 
instead the working register Reax and the sixth host primi- 
tive instruction to utilize the previously unused working 
register R4 instead of the register Reax, these hardware 
dependencies are eliminated. 



After the scheduling process which organizes the primitive host 
operations as multiple operations that can execute in the parallel on 
the host VLIW hardware. Each line shows the parallel operations that 
the VLIW machine executes, and the "&" indicates the parallelism. 



// Live out 
// Live out 



The comment "Live Out" refers to the need to actually maintain Reax and 
Recx correctly prior to the commit. Otherwise further optimization might be 
possible. 

The summary above illustrates the sequence of host 
primitive instructions which remain at this point in the 
optimization process. While this example shows the main- 
tenance of the target instruction pointer (EIP) inline, it is 
possible to maintain the pointer EIP for branches out of line 
at translation time, which would remove the pointer EIP 
updating sequence from this and subsequent steps of the 
example. 



Renaming to reduce register resource dependencies. This will allow 
subsequent scheduling to be more effective. From this point on, the 
original target X86 code is omitted as the relationship between individual 
target X86 instructions and host instructions becomes increasingly blurred. 





add 


R2,Rebp,0x8 


& add R0,Rebp,Oxc 


25 


nop 




& add R7,Rebp,OxlO 




Id 


R3, [R2] 


& add Rseq,Reip,Length(block) 




Id 


Rl, [RO] 


& add R4,R3,4 




st 


[R3],R1 


& Idc Rtarg,EIP(target) 




Id 


Reax, [R7] 


& nop 




St 


[R2],R4 


& sub Recx,Reax,l 


30 


St 


[R7],Recx 


& andcc Rll,Reax,Reax 




selcc 


Reip,Rseq,Rtarg 


& jg mainloop,mainloop & commit 



35 



add 


RO,Rebp,0xc 




Id 


Rl, [RO] 




add 


R2,Rebp,0xS 




Id 


R3, [R2] 




St 


[R31R1 




add 


R4,R3,4 




St 


[R21R4 




add 


R7,Rebp,0xl0 




Id 


Reax, [R7] 


// Live out 


sub 


Recx,Reax,l 


// Live out 


St 


[R71Recx 




andcc 


Rll,Reax,Reax 




add 


Rseq,Reip,Length(block) 




Idc 


Rtarg,EIP(target) 




selcc 


Reip,Rseq,Rtarg 




commit 






jg 


mainloop,mainloop 





Host Instruction key: 
nop - no operation 

The above sample illustrates the scheduling of host primi- 
tive instructions for execution on the morph host. In this 
example, the morph host is presumed to be a VLIW pro- 
cessor which in addition to the hardware enhancements 
provided for cooperating with the code morphing software 
40 also includes, among other processing units, two arithmetic 
and logic (ALU) units. The first line illustrates two indi- 
vidual add instructions which have been scheduled to run 
together on the morph host. As may be seen, these are the 
third and the eight primitive host instructions in the sample 
45 just before the summary above. The second line includes a 
NOP instruction (no operation but go to next instruction) and 
another add instruction. The NOP instruction illustrates that 
there are not always two instructions which can be nm 
together even after some scheduling optimizing has taken 
50 place. In any case, this sample illustrates that only nine sets 
of primitive host instructions are left at this point to execute 
the original ten target instructions. 



55 



60 



Resolve host branch targets and chain stored translations 



This sample illustrates a next step of optimization, nor- 
mally called register renaming, in which operations requir- 65 
ing working registers used for more than one operation in the 
sequence of host primitive instructions are changed to utilize 



add 


R2,Rebp.0x8 


& add RO.Rebp.Oxc 


nop 




& add R7,Rebp,OxlO 


Id 


R3, [R2] 


& add Rseq,Reip,LengthO)lock) 


Id 


Rl, [RO] 


& add R4,R3,4 


st 


[R3LR1 


& Idc Rtarg,EIP(target) 


Id 


Reax, [R7] 


& nop 


st 


[R2].R4 


& sub Recx,Rcax,l 


st 


[R7],Reac 


& andcc Rll,Reax,Reax 


selcc 


Reip^eq^targ 


& jg Sequential,'T^rget & commit 



This sample illustrates essentially the same set of host 
primitive instructions except that the instructions have by 
now been stored in the translation buffer and executed one 
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or more times because the last jump (jg) instruction now 
points to a jump address furnished by chaining to another 
sequence of translated instructions. The chaining process 
takes the sequence of instructions out of the translator main 
loop so that translation of the sequence has been completed. 5 



Advanced Optimizations, Backward Code Motion: 
This and subsequent examples start with the code prior to scheduliag. 
This optimization first depends on detecting that the code is a loop. 
Then invariant operations can be moved out of the loop body and executed 10 
once before entering the loop body. 



entry: 








add 


R0,Rebp,Oxc 




add 


R2,Rebp,0x8 




add 


R7,Rebp,OxlO 




add 


Rseq,Reip,Length(b]ock) 




Idc 


Rtarg,E[P(target) 


Loop: 








Id 


Rl, [RO] 




Id 


R3, [R2] 




St 


[R3]^l 




add 


R4,R3,4 




st 






Id 


Reax, [R7] 




sub 


Recx,Reax,l 




st 


[R7],Recx 




andcc 


Rll,Reax,Reax 




selcc 


Reip,Rseq,Rtarg 




commit 






jg 


mainloopjLoop 



The above sample illustrates an advanced optimization 
step which is usually only utilized with sequences which are 
to be repeated a large number of times. The process first 
detects translations that form loops, and reviews the indi- 
vidual primitives host instructions to determine which 
instructions produce constant results within the loop body. 
These instructions are removed from the loop and executed 
only once to place a value in a register; from that point on, 
the value stored in the register is used rather than rerunning 
the instruction. 



Schedule the loop body after backward code motion. For example 
purposes, only the code in the loop body is shown scheduled 



Entry: 



Loop: 



add 


RO,RebpPxc 




add 


R2,Rebp,0x8 




add 


R7,Rebp,OxlO 




add 


Rseq,Reip,Length(block) 




Idc 


Rtarg,EIP(target) 




Id 


R3, [R2] 


& nop 


Id 


Rl, [RO] 


& add R4,R3,4 


st 


[R31R1 


& nop 


Id 


Reax, [R7] 


& nop 


st 


[R2],R4 


& sub Recx,Reax,l 


st 


[R7],Recx 


& andcc Rl l,Reax,Reax 


selcc 


Reip,R5eq,Rtarg 


& jg SequentialjLoop & 






commit 



Host Instruction key: 

Idc = load a 32-bit constant 

When these non-repetitive instructions are removed from 
the loop and the sequence is scheduled for execution, the 
scheduled instructions appear as in the last sample above. It 
can be seen that the initial instructions are performed but 
once during the first iteration of the loop and thereafter only 
the host primitive instructions remaining in the seven clock 
intervals shown are executed during the loop. Thus, the 
execution time has been reduced to seven instruction inter- 
vals from the ten instructions necessary to execute the 
primitive target instructions. 
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As may be seen, the steps which have been removed from 
the loop are address generation steps. Thus, address genera- 
tion only need be done once per loop invocation in the 
improved microprocessor; that is, the address generation 
need only be done one time. On the other hand, the address 
generation hardware of the X86 target processor must gen- 
erate these addresses each time the loop is executed. If a loop 
is executed one hundred times, the improved microprocessor 
generates the addresses only once while a target processor 
would generate each address one hundred times. 

After Backward Code Motion: 

Target: 



Loop: 



25 



add 


RO,Rebp,0xc 


add 


R2,Rebp,0x8 


add 


R7,Rebp,OxlO 


add 


Rseq,Reip,Length(block) 


Idc 


Rtarg,EIP(target) 


Id 


Rl, [ROl 


Id 


R3, [R2] 


st 


[R3],R1 


add 


R4,R3,4 


st 


[R2],R4 


Id 


Reax, [R7] 


sub 


Rccx,Rcax,l 


st 


[R7],Recx 


andcc 


Rll,Reax,Reax 


selcc 


Reip,Rscq,Rtarg 


commit 




jg 


mainloop,Loop 



//Live out 
//Live out 



Register Allocation: 

This shows the use of register alias detection hardware of the morph 
host that allows variables to be safely moved from memory into 
registers. The starting point is the code after "backward code motion". 
This shows the optimization that can eliminate loads. 
First the loads are performed. The address is protected by the alias 
hardware, such that should a store to the address occur, an "alias" 
exception is raised. The loads in the loop body are then replaced with 
copies. After the main body of the loop, the alias hardware is freed. 
Entry: 

add RO,Rebp,Oxc 
add R2,Rebp,0x8 
add R7,Rebp,QxlO 
add R5eq,Reip,Length{block) 
Idc Rtarg,EIP(target) 
Id Rc, [RO] 



prot [ROjyAJiasl 



;First do the load of the 
variable from memory 
;Then protect the memory 
location from stores 



50 



Loop: 



55 



60 



Epilog: 



Id 

prot 
Id 

prot 

copy 
copy 
st 

add 

copy 

st 

copy 
sub 
copy 
st 

andcc 
selcc 
commit 
jg 

FA 
FA 



Rs, [R23 
[R2lAlias2 
Rn, [R7] 
[R7pVIias3 

Rl,Rc 

R3,R5 

[R31R1 

R4,Rs,4 

Rs,R4 

[R2],Rs,NoAliasCheck 
Reax,Rn 
Rccx,Reax,l 
Rn,Rccx 

[R7],Rn,noAliasChcck 

Rl],Reax,Reax 

Reip,Rjseq,Rtarg 



//Live out 
//Live out 



EpilogjLoop 

Alias 1 
Alias2 



Ftee the alias detection hardware 
Ftee the alias detection hardware 
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-continued 



FA Alias3 
j Sequential 



Free the alias detection hardware 



Host Instruction key: 

protect - protect address from loads 

FA - free alias 

copy - copy 

j - jump 



Copy Propagation: 
After using the alias hardware to turn loads within the loop body into 
copies, copy propagation allows the elimination of some copies. 



Entry: 



Loop: 



Epilog: 



add 


RO,Rcbp,Oxc 


add 


R2,Rcbp,0x8 


add 


R7,Rebp,0xl0 


add 


Rse q,Reip,Length(b lock) 


Idc 


Rtarg,EIP(taiget) 


Id 


Rc, [RO] 


prot 


[RO]Aliasl 


Id 


Rs, [R2] 


prot 


[R2]Alias2 


Id 


Recx, [R7] 


prot 


[R7]Alias3 


St 


[Rs],Rc 


add 


Rs,Rs,4 


St 


[R2],Rs,NoAliasCheck 


copy 


Rcax^Recx 


sub 


Recx,Reax,l 


St 


[R7],Recx,NoAliasCheck 


andcc 


Rll, Reax,Reax 


selcc 


Reip,Rseq,Rtarg 


commit 




jg 


EpilogjLoop 


FA 


Alias 1 


FA 


Alias2 


FA 


Aliaa3 


j 


Sequential 



//Live out 
//Live out 



This sample illustrates the next stage of optimization in 
which it is recognized that most of the copy instructions 
which replaced the load instructions in the optimization 
illustrated in the last sample are unnecessary and may be 
eliminated. That is, if a register-to-register copy operation 
takes place, then the data existed before the operation in the 
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This sample illustrates an even more advanced optimiza 
tion which may be practiced by the microprocessor includ- 
ing the present invention. Referring back to the second 
sample before this sample, it will be noticed that the first 
three add instructions involved computing addresses on the 
stack. These addresses do not change during the execution of 15 
the sequence of host operations. Consequently, the values 
stored at these addresses may be retrieved from memory and 
loaded in registers where they are immediately available for 
execution. As may be seen, this is done in host primitive 
instructions six, eight, and ten. In instructions seven, nine 20 
and eleven, each of the memory addresses is marked as 
protected by special host alias hardware and the registers are 
indicated as aliases for those memory addresses so that any 
attempt to vary the data wiU cause an exception. At this 
point, each of the load operations involving moving data 25 
from these stack memory addresses becomes a simple 
register-to-register copy operation which proceeds much 
faster than loading from a memory address. It should be 
noted that once the loop has been executed until n=0, the 
protection must be removed from each of the memory 30 
addresses so that the alias registers may be otherwise 
utilized. 
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register from which the data was copied. If so, the data can 
be accessed in the first register rather than the register to 
which it is being copied and the copy operation eliminated. 
As may be seen, this eliminates the first, second, fifth, and 
ninth primitive host instructions shown in the loop of the last 
sample. In addition, the registers used in others of the host 
primitive instructions are also changed to reflect the correct 
registers for the data. Thus, for example, when the first and 
second copy instructions are eliminated, the third store 
instruction must copy the data from the working register Rc 
where it exists (rather than register Rl) and place the data at 
the address indicated in working register Rs where the 
address exists (rather than register R3). 

Example illustrating scheduling of the loop body only. 

Entry: 



Loop: 



Epilog: 



35 



add 


RO,Rebp,Oxc 




add 


R2,Rebp,0x8 




add 


R7,Rebp,0xl0 




add 


R5eq,Rcip,Lcngth(blo ck) 




Idc 


Rtarg,EIP(target) 




Id 


Rc, [RO] 




prot 


[RO]Aliasl 




Id 


Rs, [R2] 




prot 


[R2]Alias2 




Id 


Recx, [R7] 




prot 


[R7]Alias3 




St 


[Rs].Rc. 


& add Rs3s,4 & copy 






Reax,Recx 


St 


[R2],Rs^AC 


& sub Recx,Reax,l 


St 


[R7],Recx,NAC 


& andcc Rll,Reax,Reax 


selcc 


Reip,Rseq,Rtarg 


& Jg Epilog,Loop & commit 


FA 


Alias 1 




FA 


Alias2 




FA 


AliflsS 




j 


Sequential 





Host Instruction key: 
MAC o No Alias Check 

40 

The scheduled host instructions are illustrated in the 
sample above. It will be noted that the sequence is such that 
fewer clocks are required to execute the loop than to execute 
the primitive target instruction originally decoded from the 
45 source code. Thus, apart from all of the other acceleration 
accomplished, the total number of combined operations to 
be mn is simply less than the operations necessary to execute 
the original target code. 
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Store Elimination by use of the alias hardware. 



Entry: 
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add 
add 
add 
add 
Idc 
Id 

prot 
Id 

piot 
Id 

prot 



RO,Rebp,Oxc 

R2,Rebp,0x8 

R7,Rebp,OxlO 

Rscq^ cip,l>cngth(blo ck) 

Rtarg,EIP(targct) 

Rc, [RO] 

[RO],Aliasl 

Rs, [R2] 
[R2]Alias2 



Loop: 



Recx. [R7] 
[R7]^ias3 



[Rs],Rc, 



;protcct the address from 
loads and stores 

;protect the address from 
loads and stores 

;protect the address from 
loads and stores 

& add Rs3s,4 & copy 
Reax,Recx 
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-continued 
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Store Himination by use of the alias haidwarc. 



sub 


RecXjReaXjl 


& andcc Rll,Rcax,Reax 




selcc 


Reip,Rseq,Rtarg 


& jg EpilogjLoop & 
commit 


5 


Epilog; 








FA 


Alias 1 






FA 


Alias2 






FA 


AliasS 






St 


[R2],Rs 


;writcback the final value 


10 






of Rs 


St 


[R7XRCCX 


;writcback the final value 
of Rccx 




j 


Sequential 







The final optimization shown in this sample is the use of 
the alias hardware to eliminate stores. This eliminates the 
stores from within the loop body, and performs them only in 
the loop epilog. This reduces the number of host instructions 
within the loop body to three compared to the original ten 
target instructions. 

Although the present invention has been described in 
terms of a preferred embodiment, it will be appreciated that 
various modifications and alterations might be made by 
those skilled in the art without departing from the spirit and 
scope of the invention. For example, although the invention 
has been described with relation to the emulation of X86 
processors, it should be understood that the invention 
applies just as well to programs designed for other processor 
architectures, and programs that execute on virtual 
machines, such as P code, Postscript, or Java programs. The 
invention should therefore be measured in terms of the 
claims which follow. 

What is claimed is: 

1. A memory controller for use with a microprocessor 
including an execution unit having a plurality of registers, 
the memory controller comprising: 

means for storing memory data to be frequently accessed 

during a code sequence by the execution unit in a first 

register of the execution unit, 
means for holding the memory address of the data in the 

first register of the execution unit in a second register 

of the execution unit during the execution of the code 

sequence by the execution unit, 
means for detecting an access attempted to the memory 

address during the execution of the code sequence, and 
means for maintaining the data in the first register and in 

memory consistent and valid during execution of the 

code sequence. 

2. A memory controller as claimed in claim 1 in which the 
means for detecting an access attempted to the memory 
address during the execution of the code sequence comprises 
a comparator for comparing the access address with the 
memory address in the second register and generating an 
exception in response to a comparison. 

3. A memory controller as claimed in claim 2 in which the 
means for maintaining the data in the first register and in 
memory consistent and vafid during execution of the code 
sequence comprises software implemented means respon- 
sive to an exception for replacing stale data with valid data 
being written. 

4. A memory controller as claimed in claim 2 in which the 
comparator comprises means for generating an exception to 
an attempt to write the memory address when the data in the 
first register is being utilized instead of data at the memory 
address during execution of the code sequence; and 

in which the means for maintaining the data in the first 
register and in memory consistent and valid during 



20 



30 



35 



45 



50 



55 
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execution of the code sequence comprises means for 
updating the data in the first register with data written 
to the memory address. 

5. A memory controller as claimed in claim 2 in which the 
comparator comprises means for generating an exception to 
an attempt to read the memory address when data is being 
loaded to the first register in place of the memory address 
during execution of the code sequence; and 

in which the means for maintaining the data in the first 
register and in memory consistent and vahd during 
execution of the code sequence comprises means for 
updating the data at the memory address with data in 
the first register. 

6. A memory controller for use with a microprocessor 
including an execution unit having a plurality of registers, 
the memory controller comprising: 

means for storing memory data to be frequently accessed 
during a code sequence by the execution unit in a first 
register of the execution unit, 

means for holding the memory address of the data in the 
first register of the execution unit in a second register 
of the execution unit during the execution of the code 
sequence by the execution unit, 

means for detecting an access attempted to the memory 
address during the execution of the code sequence 
comprising a comparator for comparing the access 
address with the memory address in the second register 
and generating an exception in response to a 
comparison, and 

means for maintaining the data in the first register and in 
memory consistent and valid during execution of the 
code sequence 

comprising software implemented means responsive to an 
exception for retranslating into a new code sequence 
without storing memory data in the first register which 
is frequently utilized by the execution unit during a 
code sequence and executing the new code sequence. 

7. A computer system comprising: 

a host processor designed to execute instructions of a host 
instmction set, the host processor including an execu- 
tion unit having a plurality of registers; 

software for translating instructions from a target instruc- 
tion set to instructions of the host instruction set; 

memory for storing target instructions from a program 
being translated, and 

a memory controller for storing memory data which is 
frequently utilized by the host processor during a code 
sequence in a first register of the execution unit, 

a second register for holding a memory address of 
memory data stored in the first register, and 

means for selecting data frequendy utilized by the host 
processor during a code sequence to be stored in the 
second register. 

8. A computer system as claimed in claim 7 which further 
comprises means for assuring that data stored in the first 
register and at set the memory address remain consistent. 

9. A computer system as claimed in claim 8 in which the 
means for assuring that data stored in the first register and at 
the memory address remain consistent comprises a com- 
parator for comparing addresses of memory accesses with a 
memory address in the second register and generating an 
exception when addresses compare. 

10. A computer as claimed in claim 9 in which the means 
for assuring that data stored in the first register and at the 
memory address remain consistent further comprises soft- 
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ware implemented means responsive to an exception gen- 
erated by the comparator for replacing stale data with valid 
data being written. 

11. A computer system as claimed in claim 9 farther 
comprising means responding to an exception taken during S 
a write access of a memory address for updating data stored 

in the first register with data being written to the memory 
address. 

12. A computer system as claimed in claim 9 further 
comprising means responding to an exception taken during lO 
a read access of a memory address for updating data stored 

at the memory address with data stored in the first register. 

13. A computer comprising: 

a host processor designed to execute instructions of a host 
instruction set, the host processor including an execu- ^5 
tion unit having a plurality of registers; 

software for translating instructions from a target instruc- 
tion set to instructions of the host instruction set; 

memory for storing target instructions from a program 
being translated; 

a memory controller for storing memory data which is 
frequently utilized by the host processor during a code 
sequence in a first register of the execution unit; 

a second register for holding a memory address of 25 
memory data stored in the first register; 

means for selecting data frequently utihzed by the host 
processor during a code sequence to be stored in the 
second register; and 

means for assuring that data stored in the first register and 
at the memory address remain consistent comprsing: 
a comparator for comparing addresses of memory 
accesses with a memory address in the second reg- 
ister and generating an exception when addresses 
compare, and 

software implemented means responsive to an excep- 
tion generated by the comparator for retranslating 
into a new code sequence without storing memory 
data in the first register which is frequently utiUzed 
by the host processor during a code sequence and ^° 
executing the new code sequence. 

14. A method for enhancing the speed of a processor 
comprising the steps of: 

placing memory data to be frequently accessed during a 
code sequence of the execution unit in a first register of 
the execution unit, 

storing a memory address of the data in the first register 
of the execution unit in a second register of the execu- 
tion unit, 

' 50 

detecting an access attempted to the memory address 
during the execution of the code sequence, and 

maintaining the data in the first register and at the memory 
address consistent and valid during execution of the 
code sequence. 55 

15. A method as claimed in claim 14 in which the step of 
detecting an access attempted to the memory address during 
the execution of the code sequence comprises comparing an 
access address with the memory address in the second 
register, and 50 

generating an exception in response to a comparison. 

16. A method as claimed in claim 15 in which the step of 
maintaining the data in the first register and at the memory 
address consistent and valid during execution of the code 
sequence further comprises responding to an exception 65 
generated by a comparison by replacing stale data with vaUd 
data being written. 
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17. A method as claimed in claim 15 

in which the step of generating an exception in response 
to a comparison comprises generating an exception to 
an attempt to write the memory address when the data 
in the first register is being copied to another register 
during execution of the code sequence; and 

in which the step of maintaining the data in the first 
register and in memory consistent and vahd during 
execution of the code sequence comprises updating the 
data in the first register with data to be written to the 
memory address. 

18. A method as claimed in claim 15 

in which the step of generating an exception in response 
to a comparison comprises generating an exception to 
an attempt to read the memory address when data is 
being copied to the first register during execution of the 
code sequence; and 

in which the step of maintaining the data in the first 
register and in memory consistent and valid during 
execution of the code sequence comprises updating the 
data at the memory address with data in the first 
register. 

19. A method for enhancing the speed of a processor 
comprising the steps of: 

placing memory data to be frequently accessed during a 
code sequence of the execution tmit in a first register of 
the execution unit, 

storing a memory address of the data in the first register 
of the execution unit in a second register of the execu- 
tion unit, 

detecting an access attempted to the memory address 
during the execution of the code sequence comprising; 
comparing an access address with the memory address 

in the second register, and 
generating an exception in response to a comparison, 
and 

maintaining the data in the first register and at the memory 
address consistent and valid during execution of the 
code sequence comprising 

responding to an exception generated by a comparison by 
retranslating into a new code sequence without storing 
memory data in the first register which is frequently 
utilized by the execution unit during a code sequence 
and executing the new code sequence. 

20. A microprocessor comprising: 

a host processor capable of executing a first instruction 
set, 

code morphing software for translating programs written 
for a target processor having a second different instruc- 
tion set into instructions of the first instruction set for 
execution by the host processor, and 
a memory controller comprising; 
a first execution unit register for storing memory data 
which is frequently utULzed by a processing unit in 
executing a code sequence, 
a second execution unit register for holding a memory 
address of memory data stored in the first register, 
and 

optimizing means for selecting data frequently utilized 
by a processing unit to be stored in the first register 
while executing the code sequence. 

21. A microprocessor as claimed in claim 20 which further 
comprises means for assuring that data stored in the first 
register and at the memory address remain consistent. 

22. A microprocessor as claimed in claim 21 in which the 
means for assuring that data stored in the first register and at 
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the memory address remain consistent comprises a com- 
parator for comparing addresses of memory accesses with a 
memory address held in the second register and generating 
an exception when addresses compare. 

23. A microprocessor as claimed in claim 22 in which the 5 
means for assuring that data stored in the first register and at 
the memory address remain consistent further comprises 
software implemented means responsive to an exception 
generated by the comparator for replacing stale data with 
valid data being written. 

24. A microprocessor as claimed in claim 22 further 
comprising means responding to an exception taken during 
a write access of a memory address for updating data stored 
in the first register with data being written to the memory 
address. 

25. A microprocessor as claimed in claim 22 further 
comprising means responding to an exception taken during 
a read access of a memory address for updating data stored 
at the memory address with data stored in the first register. 

26. A microprocessor comprising: 

a host processor capable of executing a first instruction 20 
set, 

code morphing software for translating programs written 
for a target processor having a second different instruc- 
tion set into instructions of the first instruction set for 
execution by the host processor, and 25 
a memory controller comprising: 

a first register for storing memory data which is fre- 
quently utilized by a processing unit in executing a 
code sequence, 
a second register for holding a memory address of 30 

memory data stored in the first register, and 
optimizing means for selecting data frequendy utilized 
by a processing unit to be stored in the first register 
while executing the code sequence, and 
means for assuring that data stored in the first register 
and at the memory address remain consistent com- 
prising: 

a comparator for comparing addresses of memory 
accesses with a memory address held in the sec- 
ond register and generating an exception when 
addresses compare, and 

software implemented means responsive to an 
exception generated by the comparator for retrans- 
lating into a new code sequence without storing 
memory data in the first register which is fre- 
quently utilized by the host processor during a 
code sequence and executing the new code 
sequence. 



27- A memory controller comprising: 

a first register in a processing unit for storing memory data 
which is frequently utilized by a processing unit during 
execution of a code sequence, 

a second register in a processing unit for storing a memory 
address of memory data stored in the first register, 

means for selecting data frequently utilized by a process- 
ing unit during execution of a code sequence to be 
stored in the second register, and 

means for assuring that data stored in the first register and 
at the memory address remain consistent. 

28. A memory controller as claimed in claim 27 in which 
the means for assuring that data stored in the first register 
and at the memory address remain consistent comprises: 

a comparator for comparing addresses of memory 
accesses with a memory address in the second register 
and generating an exception when addresses compare. 

29. A memory controller as claimed in claim 28 in which 
the means for assuring that data stored in the first register 
and at the memory address remain consistent further com- 
prises means responsive to an exception generated by the 
comparator for replacing stale data with valid data being 
written. 

30. A memory controller as claimed in claim 28 further 
comprising means responding to an exception taken during 
a write access of a memory address for updating data stored 
in the first register with data stored being written to the 
memory address. 

31. A memory controller as claimed in claim 28 further 
comprising means responding to an exception taken during 
a read access of a memory address for updating data stored 
at the memory address with data stored in the first register. 

32. A memory controller comprising: 

a first register for storing memory data which is frequently 
utilized by a processing unit during execution of a code 
sequence, 

a second register for storing a memory address of memory 
data stored in the first register, 

means for selecting data frequently utilized by a process- 
ing unit during execution of a code sequence to be 
stored in the second register, and 

means for assuring that data stored in the first register and 
at the memory address remain consistent. 
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